How can I download metadata on a list of files

Posted in TCGA data on the CGC by Anjan Purkayastha Fri Sep 02 2016 19:11:05 GMT+0000 (UTC)·3·Viewed 998 times

I would like to download the meta-data for a set of files. For examples, I have four gene-wise readcounts text files. Each file has associated metadata: platform type, reference genome, disease, investigation etc. I would like to download the associated metadata for a set of files, as a text file, onto my local machine. Is it possible to do this? Thanks, Anjan
Erik Lehnert
Sep 6, 2016

Hi Anjan,

Currently, there is not a way to do this with the GUI. However, it is possible to do obtain file metadata using the API. You can find examples of how to obtain metadata using the Python API library here

That said, we are investigating implementing this, as several users have expressed a desire for this functionality.

Maor Maor
June 5, 2018

For anyone who come across this in the future:

A nearly perfect solution exists:
API docs are found at :
https://docs.cancergenomicscloud.org/blog/programmatically-access-tcga-data-using-the-seven-bridges-cancer-genomics-cloud
https://docs.cancergenomicscloud.org/v1.0/docs/browse-datasets-via-the-datasets-api

The gist of it:

  • You need an APi key, and you build a function that with the key makes queries.
  • You need to know the dataset & entity schema, so you can know what fields you can query, for examples for tcga cases: you issue a GET query for the URL $CGC_URL/datasets/tcga/v0/cases/schema

  • Then let's say you want to get the disease type and primary site for SPECIFIC cases, you notice the response of the cases schema, it has fields: hasID, hasPrimarySite and hasDiseaseType.

  • you issue a POST query to URL to $CGC_URL/query?offset=0 with post data:
    query_body = {
    "entity":"cases",
    "fields": ["hasID","hasInvestigation","hasDiseaseType","hasPrimarySite"],
    "hasID": ["TCGA-CN-A63T","TCGA-CJ-4902"]
    }
    (Requesting two specific cases, for each case return the 4 fields we want)

  • Now let's say you want the full metadata for a specific case. notice in the last response each case has field "hasID" which you requested and looks like 'TCGA-XX-XXXX' (logical id). But it also has a programmatic id, the field "id" (always returned, regardless if you requested it or not).

  • You can issue a GET call to $CGC_URL/datasets/v0/tcga/cases/$PROGRAMATIC_ID
    will give you all info on this specific case. Each entity response (such as case), contain links to other related entities. So in the last json call you will see a link:
    $CGC_URL/datasets/v0/tcga/cases/$PROGRAMATIC_ID/files
    and a GET call to this will list all files for this specific case.

Final notes:

  • My impression of the TCGA programmatic API is that it is pretty awesome, allowing to do most operations. But documentation is severely lacking. It takes some time to learn it's capabilities. More examples are needed.
  • Some features are still buggy, a list of bugs that I found:
    PAGING when making queries with "hasID" : list. Seems to always return first page, regardless of offset=$NUM. I implemented a paging by controlling the list I feed into hasID field. .SVS slides are now available in GDC portal, (previously only available in GDC legacy archive) as of May 21 2018. However listing files of case through the API does not return .svs files. An example to reproduce

issue a GET call to :
https://cgc-datasets-api.sbgenomics.com/datasets/v0/tcga/cases/9FE336A8-08A7-4FE7-BF45-AFD6A8EB9C75/files

return 44 files, none of which are .svs files. However in website:
https://portal.gdc.cancer.gov/cases/9fe336a8-08a7-4fe7-bf45-afd6a8eb9c75
You can see .svs files are available.

Maor Maor
June 5, 2018

For anyone who come across this in the future:

A nearly perfect solution exists:
API docs are found at :
https://docs.cancergenomicscloud.org/blog/programmatically-access-tcga-data-using-the-seven-bridges-cancer-genomics-cloud
https://docs.cancergenomicscloud.org/v1.0/docs/browse-datasets-via-the-datasets-api

The gist of it:

  • You need an APi key, and you build a function that with the key makes queries.
  • You need to know the dataset & entity schema, so you can know what fields you can query, for examples for tcga cases: you issue a GET query for the URL $CGC_URL/datasets/tcga/v0/cases/schema

  • Then let's say you want to get the disease type and primary site for SPECIFIC cases, you notice the response of the cases schema, it has fields: hasID, hasPrimarySite and hasDiseaseType.

  • you issue a POST query to URL to $CGC_URL/query?offset=0 with post data:
    query_body = {
    "entity":"cases",
    "fields": ["hasID","hasInvestigation","hasDiseaseType","hasPrimarySite"],
    "hasID": ["TCGA-CN-A63T","TCGA-CJ-4902"]
    }
    (Requesting two specific cases, for each case return the 4 fields we want)

  • Now let's say you want the full metadata for a specific case. notice in the last response each case has field "hasID" which you requested and looks like 'TCGA-XX-XXXX' (logical id). But it also has a programmatic id, the field "id" (always returned, regardless if you requested it or not).

  • You can issue a GET call to $CGC_URL/datasets/v0/tcga/cases/$PROGRAMATIC_ID
    will give you all info on this specific case. Each entity response (such as case), contain links to other related entities. So in the last json call you will see a link:
    $CGC_URL/datasets/v0/tcga/cases/$PROGRAMATIC_ID/files
    and a GET call to this will list all files for this specific case.

Final notes:

  • My impression of the TCGA programmatic API is that it is pretty awesome, allowing to do most operations. But documentation is severely lacking. It takes some time to learn it's capabilities. More examples are needed.
  • Some features are still buggy, a list of bugs that I found:
    PAGING when making queries with "hasID" : list. Seems to always return first page, regardless of offset=$NUM. I implemented a paging by controlling the list I feed into hasID field. .SVS slides are now available in GDC portal, (previously only available in GDC legacy archive) as of May 21 2018. However listing files of case through the API does not return .svs files. An example to reproduce

issue a GET call to :
https://cgc-datasets-api.sbgenomics.com/datasets/v0/tcga/cases/9FE336A8-08A7-4FE7-BF45-AFD6A8EB9C75/files

return 44 files, none of which are .svs files. However in website:
https://portal.gdc.cancer.gov/cases/9fe336a8-08a7-4fe7-bf45-afd6a8eb9c75
You can see .svs files are available.

  
Markdown is allowed