{"metadata":{"image":[],"title":"","description":""},"api":{"url":"","auth":"required","results":{"codes":[]},"settings":"","params":[]},"next":{"description":"","pages":[]},"title":"Fetch metadata from the PDC API","type":"basic","slug":"fetch-metadata-from-the-pdc-api","excerpt":"","body":"[block:callout]\n{\n  \"type\": \"info\",\n  \"title\": \"\",\n  \"body\": \"This tutorial provides an example of how to fetch metadata for data imported from the PDC. Learn how to [import data from the PDC](doc:import-from-the-pdc).\"\n}\n[/block]\n### Introduction\n\nThis short tutorial will show how you can use Data Cruncher and the PDC API to get the desired metadata for proteomic data [imported from the PDC](https://pdc.cancer.gov/pdc/browse). Data Cruncher is an interactive environment within the CGC that allows you perform further analyses of your data using JupyterLab or RStudio. The PDC API will retrieve the most up to date metadata for the files in your CGC projects. We will also show how you can find the corresponding genomic data from the TCGA GRCh38 dataset for the proteomic data from the PDC and how you can import the data into a project on the CGC.\n\nThe use case shown in the tutorial is for demonstration purposes, you can customize the code to retrieve metadata of you choice.\n\n### Prerequisites\n\n* An active account on the CGC.\n\n### Steps\n1. [Create a project on the CGC](#section-1-create-a-project-on-the-cgc).\n2. [Use Data Cruncher to retrieve metadata from the PDC API](#section-2-use-data-cruncher-to-extract-metadata-from-the-pdc-api).\n3. [Create iTRAQ4 to Case Submitter ID mapping for RMS files](#section-create-itraq4-to-case-submitter-id-mapping-for-rms-files)\n4. [Query the TCGA GRCh38 dataset using the CGC Datasets API](#section-create-itraq4-to-case-submitter-id-mapping-for-rms-files)\n\n### 1. Create a project on the CGC\n1. On the CGC home page, click **Projects**.\n2. In the bottom-right corner of the dropdown menu click**+ Create a project**. Project creation dialog opens.\n3. Name the project **PDC Metadata**.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/b56c16b-pdc-tutorial-1.png\",\n        \"pdc-tutorial-1.png\",\n        477,\n        482,\n        \"#e1e9ea\"\n      ]\n    }\n  ]\n}\n[/block]\n4. Keep the predefined values for other settings. If you want to learn more about project settings, see [more details](doc:modify-project-settings).\n5. Click **Create**. The project is now created and you are taken to the [project dashboard](doc:manage-the-project-dashboard).\n\nThe next step is to create a Data Cruncher analysis and use it to retrieve PDC metadata.\n\n### 2. Use Data Cruncher to extract metadata from the PDC API\n\nWe've created an example tutorial that will show you how to fetch metadata for an interactive analysis. This part of the tutorial incorporates and demonstrates a part of the code and instructions that are originally available in the [PDC Clustergram example](https://pdc.cancer.gov/API_documentation/PDC_clustergram.html) from PDC documentation. \n\n1. In the **PDC Metadata** project, click the **Interactive analysis** tab on the right.\n2. On the **Data Cruncher** card click **Open**. You are taken to the Data Cruncher.\n3. In the top-right corner, click **Create new analysis**. If you don't have any previous analyses, you will see the **Create your first analysis button** in the center of the screen.\n4. In the **Analysis name** field, enter **PDC Metadata**. \n5. Select **JupyterLab** as the analysis environment.\n6. Click **Next**.\n7. Keep the predefined values for **Compute requirements** and click **Start the analysis**. The CGC will start acquiring an adequate instance for your analysis, which may take a few minutes. Once the analysis is ready, you will be notified.\n8. Run your analysis by clicking **Open in editor** in the top-right corner. JupyterLab opens.\n9. In the **Notebook** section on the JupyterLab home screen, select **Python 3**. You can now start entering the analysis code in the cells. Please note that each time a block of code is entered, it needs to be executed by clicking <img src=\"https://files.readme.io/463bd79-run.png\" width=\"auto\" align=\"inline\" style=\"margin:1px\"/> or by pressing Shift + Enter on the keyboard.\n\n10. Let's begin by importing the `requests` and `json` modules in Python:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"import requests\\nimport json\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n11. We will now add the function that will be used to query the PDC API:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"def query_pdc(query):\\n    URL = \\\"https://pdc.cancer.gov/graphql\\\"\\n    # Send the POST graphql query\\n    print('Sending query.')\\n    pdc_response = requests.post(URL, json={'query': query})\\n\\n    # Set up a data structure for the query result\\n    decoded = dict()\\n\\n    # Check the results\\n    if pdc_response.ok:\\n        # Decode the response\\n        decoded = pdc_response.json()\\n    else:\\n        # Response not OK, see error\\n        pdc_response.raise_for_status()\\n    return decoded\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n12. Let's create the query and use the function above to fetch all proteome case IDs from the **CPTAC2 Retrospective** project:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"cases = []\\nfor offset in range(0, 1000, 100):\\n    \\n    cases_query = \\\"\\\"\\\"{\\\"\\\"\\\" + \\\"\\\"\\\"getPaginatedUICase(project_name:\\\"CPTAC2 Retrospective\\\", \\n              analytical_fraction: \\\"Proteome\\\",\\n              primary_site: \\\"Breast\\\",\\n              disease_type: \\\"Breast Invasive Carcinoma\\\",\\n              limit: 100, offset: {})\\\"\\\"\\\".format(offset) + \\\"\\\"\\\"{\\n              uiCases {\\n                case_id\\n                project_name\\n\\n              }\\n            } \\n            }\\\"\\\"\\\"\\n    \\n    result = query_pdc(cases_query)\\n    \\n    for case in result['data']['getPaginatedUICase']['uiCases']:\\n        cases.append(case['case_id'])\\n\\n        \\nprint('\\\\nNumber of cases:' + str(len(set(cases))))\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nThis outputs query status information and the number of returned cases:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"Sending query.\\nSending query.\\nSending query.\\nSending query.\\nSending query.\\nSending query.\\nSending query.\\nSending query.\\nSending query.\\nSending query.\\n\\nNumber of cases:108\",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\n13. We will now import the `pandas` library that will allow us to format and output data:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"import pandas as pd\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nIf the library is not available, install it using the [pip command](https://packaging.python.org/tutorials/installing-packages/#use-pip-for-installing).\n\n14. Next, let's set up the **Submitter ID** parameter for the given study, which we will use to retrieve the corresponding set of metadata.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"study_submitter_id = \\\"S015-1\\\" # S015-1 is TCGA_Breast_Cancer_Proteome\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n15. Now we'll create a query for clinical metadata using the defined query parameters:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"metadata_query = '''\\n    {\\n        clinicalMetadata(study_submitter_id: \\\"''' + study_submitter_id + '''\\\") {\\n            aliquot_submitter_id\\n            morphology\\n            primary_diagnosis\\n            tumor_grade\\n            tumor_stage\\n        }\\n    }\\n    '''\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n16. Next, let's query the PDC API for the clinical metadata. The data is then converted to a pandas dataframe.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"decoded = query_pdc(metadata_query)\\nmatrix = decoded['data']['clinicalMetadata']\\nmetadata = pd.DataFrame(matrix, columns=matrix[0]).set_index('aliquot_submitter_id')\\nprint('Created a dataframe of these dimensions: {}'.format(metadata.shape))\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n17. Finally, we'll print out the metadata fetched from the PDC and converted into a pandas dataframe:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"print(metadata)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nThe output should be:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"aliquot_submitter_id morphology primary_diagnosis \\\\ \\nTCGA-AO-A12B-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \\nTCGA-AO-A12B-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \\nTCGA-AO-A12D-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \\nTCGA-AO-A12D-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \\nTCGA-AO-A12D-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \\n... ... ... \\nTCGA-AO-A0JE-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \\nTCGA-AO-A0JJ-01A-31-A21W-30 8520/3 Lobular carcinoma, NOS \\nTCGA-AO-A0JL-01A-41-A21W-30 8500/3 Infiltrating duct carcinoma, NOS \\nTCGA-AO-A0JM-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \\nTCGA-AO-A126-01A-22-A21W-30 8500/3 Infiltrating duct carcinoma, NOS \\n\\n\\naliquot_submitter_id tumor_grade tumor_stage \\nTCGA-AO-A12B-01A-41-A21V-30 Not Reported stage iia \\nTCGA-AO-A12B-01A-41-A21V-30 Not Reported stage iia \\nTCGA-AO-A12D-01A-41-A21V-30 Not Reported stage iia \\nTCGA-AO-A12D-01A-41-A21V-30 Not Reported stage iia \\nTCGA-AO-A12D-01A-41-A21V-30 Not Reported stage iia \\n... ... ... \\nTCGA-AO-A0JE-01A-41-A21V-30 Not Reported stage iiia \\nTCGA-AO-A0JJ-01A-31-A21W-30 Not Reported stage iib \\nTCGA-AO-A0JL-01A-41-A21W-30 Not Reported stage iiia \\nTCGA-AO-A0JM-01A-41-A21V-30 Not Reported stage iib \\nTCGA-AO-A126-01A-22-A21W-30 Not Reported stage iia \\n\\n[152 rows x 4 columns]\",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\nThe output above has been truncated for brevity. All 152 rows will be displayed in your analysis.\n\n### 3. Create iTRAQ4 to Case Submitter ID mapping for RMS files\n\nThe objective of this section of the tutorial is to demonstrate how to filter files (by type), fetch a file's metadata and use it to create iTRAQ4 to Case Submitter ID matches. In this case, we will be fetching data about Raw Mass Spectra (RMS) files using the same Study Submitter ID used during the course of this tutorial, `S015-1`.\n\n1. Create a query that will be used to fetch all data about RMS files from the study:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"rms_query = \\\"\\\"\\\"{filesPerStudy(study_submitter_id: \\\"S015-1\\\",\\n  data_category:\\\"Raw Mass Spectra\\\") {\\n  study_id\\n  study_name\\n  file_id\\n  data_category\\n}}\\\"\\\"\\\"\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n2. Execute the query:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"rms = query_pdc(rms_query)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n3. Parse returned data:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"### We now have all Raw Mass Spectra files for S015-1\\nrms_result = rms['data']['filesPerStudy']\\n\\n# Show first 5 files\\nfor file in rms_result[:5]:\\n    print(file)\\n    print('\\\\n')\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nThe returned output should be:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"{'study_id': 'b8da9eeb-57b8-11e8-b07a-00a098d917f8', 'study_name': 'TCGA_Breast_Cancer_Proteome', 'file_id': '00064e82-5ceb-11e9-849f-005056921935', 'data_category': 'Raw Mass Spectra'}\\n\\n\\n{'study_id': 'b8da9eeb-57b8-11e8-b07a-00a098d917f8', 'study_name': 'TCGA_Breast_Cancer_Proteome', 'file_id': '0029431e-5cb0-11e9-849f-005056921935', 'data_category': 'Raw Mass Spectra'}\\n\\n\\n{'study_id': 'b8da9eeb-57b8-11e8-b07a-00a098d917f8', 'study_name': 'TCGA_Breast_Cancer_Proteome', 'file_id': '003914a0-5cd9-11e9-849f-005056921935', 'data_category': 'Raw Mass Spectra'}\\n\\n\\n{'study_id': 'b8da9eeb-57b8-11e8-b07a-00a098d917f8', 'study_name': 'TCGA_Breast_Cancer_Proteome', 'file_id': '00466270-5cef-11e9-849f-005056921935', 'data_category': 'Raw Mass Spectra'}\\n\\n\\n{'study_id': 'b8da9eeb-57b8-11e8-b07a-00a098d917f8', 'study_name': 'TCGA_Breast_Cancer_Proteome', 'file_id': '005db03e-5cdf-11e9-849f-005056921935', 'data_category': 'Raw Mass Spectra'}\",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\n4. Select the first RMS file from the returned data:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# Pick first Raw Mass Spectra file from S015-1\\nfile_id = rms_result[0]['file_id']\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n5. Fetch metadata for the selected file ID:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"file_query=\\\"\\\"\\\"{\\\"\\\"\\\" +  \\\"\\\"\\\"fileMetadata(file_id: \\\"{}\\\")\\\"\\\"\\\".format(file_id) + \\\"\\\"\\\"{\\n           file_name\\n           file_size\\n           md5sum\\n           file_location\\n           file_submitter_id\\n           fraction_number\\n           experiment_type\\n           data_category\\n           file_type\\n           file_format\\n           plex_or_dataset_name\\n           analyte\\n           instrument\\n           aliquots { \\n               aliquot_id\\n               aliquot_submitter_id\\n               label\\n               sample_id\\n               sample_submitter_id\\n               case_id\\n               case_submitter_id\\n           }\\n       }\\n      }\\\"\\\"\\\"\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n6. Execute the metadata query:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"result = query_pdc(file_query)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n7. Create a dictionary that contains the iTRAQ4 to Case Submitter ID mapping:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"label_map = {}\\naliquots = result['data']['fileMetadata'][0]['aliquots']\\nfor aliquot in aliquots:\\n    print(aliquot)\\n    print(\\\"\\\\n\\\")\\n    label = aliquot['label']\\n    if aliquot['case_id']:\\n        case_submitter_id = aliquot['case_submitter_id']\\n        label_map[label] = case_submitter_id\\n    else:\\n        label_map[label] = 'Reference'\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nThis code block returns the following output:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"{'aliquot_id': '34317a3a-6429-11e8-bcf1-0a2705229b82', 'aliquot_submitter_id': 'TCGA-C8-A12V-01A-41-A21V-30', 'label': 'iTRAQ4 115', 'sample_id': '83929378-6420-11e8-bcf1-0a2705229b82', 'sample_submitter_id': 'TCGA-C8-A12V-01A', 'case_id': 'f49043f6-63d8-11e8-bcf1-0a2705229b82', 'case_submitter_id': 'TCGA-C8-A12V'}\\n\\n\\n{'aliquot_id': 'fd50c409-6428-11e8-bcf1-0a2705229b82', 'aliquot_submitter_id': 'TCGA-AO-A0JM-01A-41-A21V-30', 'label': 'iTRAQ4 114', 'sample_id': '35800901-6420-11e8-bcf1-0a2705229b82', 'sample_submitter_id': 'TCGA-AO-A0JM-01A', 'case_id': 'c10409cb-63d8-11e8-bcf1-0a2705229b82', 'case_submitter_id': 'TCGA-AO-A0JM'}\\n\\n\\n{'aliquot_id': '6a47a426-ec51-11e9-81b4-2a2ae2dbcce4', 'aliquot_submitter_id': 'Internal Reference', 'label': 'iTRAQ4 117', 'sample_id': '6a479058-ec51-11e9-81b4-2a2ae2dbcce4', 'sample_submitter_id': 'Internal Reference', 'case_id': '6a477ef6-ec51-11e9-81b4-2a2ae2dbcce4', 'case_submitter_id': 'Internal Reference'}\\n\\n\\n{'aliquot_id': '6dbd35fe-6428-11e8-bcf1-0a2705229b82', 'aliquot_submitter_id': 'TCGA-A8-A08G-01A-13-A21W-30', 'label': 'iTRAQ4 116', 'sample_id': '78f43d0f-641f-11e8-bcf1-0a2705229b82', 'sample_submitter_id': 'TCGA-A8-A08G-01A', 'case_id': '3bdfde9b-63d8-11e8-bcf1-0a2705229b82', 'case_submitter_id': 'TCGA-A8-A08G'}\",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\n8. Let's print out the dictionary:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"label_map\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nHere's what the created dictionary looks like:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"{'iTRAQ4 115': 'TCGA-C8-A12V',\\n 'iTRAQ4 114': 'TCGA-AO-A0JM',\\n 'iTRAQ4 117': 'Internal Reference',\\n 'iTRAQ4 116': 'TCGA-A8-A08G'}\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n### 4. Query the TCGA GRCh38 dataset using the CGC Datasets API\n\nThe final section of the tutorial aims to show how you can connect proteomic data obtained from the PDC to genomics data from the TCGA GRCh38 dataset using the `SubmitterId` property.\n\n1. Set up the API URL and your authentication token. Learn how to [get your authentication token](doc:get-your-authentication-token).\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# Query TCGA GRCh38 dataset on Datasets API\\ndatasets_api_url = 'https://cgc-datasets-api.sbgenomics.com/datasets/'\\n\\n# Link to Datasets API docs for obtaining AUTH TOKEN\\ntoken = 'MY_AUTH_TOKEN'\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nPlease make sure to replace `MY_AUTH_TOKEN` with your authentication token obtained from the CGC.\n\n2. Set up the query to find cases for the submitter IDs in the TCGA GRCh38 dataset on the CGC, using the Datasets API:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"query = {\\n    \\\"entity\\\": \\\"cases\\\",\\n    \\\"hasSubmitterId\\\": [\\\"TCGA-C8-A12V\\\", \\\"TCGA-AO-A0JM\\\", \\\"TCGA-A8-A08G\\\"]\\n}\\nheaders = {\\\"X-SBG-Auth-Token\\\": token}\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nIn this query, we are using Case Submitter IDs as the linking property to find the corresponding genomic data (Case UUIDs) for the obtained proteomic information from the PDC. \n\n3. Let's execute the query:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"result = requests.post(datasets_api_url +  'tcga_grch38/v0/query', json.dumps(query), headers=headers)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n4. We will now parse and print Case UUID's from the result. \n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"cases = []\\n\\nfor case in result.json()[\\\"_embedded\\\"][\\\"cases\\\"]:\\n    cases.append(case['label'])\\n\\nprint(cases)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nThis returns a list of three items:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"['44bec761-b603-49c0-8634-f6bfe0319bb1', '719082cc-1ebe-4a51-a659-85a59db1d77d', '7e1673f8-5758-4963-8804-d5e39f06205b']\",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\n5. Now, let's fetch and list all experimental strategies and data types available on the CGC for the TCGA GRCh38 dataset:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# List all experimental strategies and Data types\\nfiles_schema = requests.get(datasets_api_url + 'tcga_grch38/v0/files/schema', headers=headers)\\n\\n# Experimental Strategy\\nprint(\\\"Experimental strategy\\\")\\nprint(files_schema.json()['hasExperimentalStrategy'])\\n\\n# DataType\\nprint(\\\"\\\\n\\\")\\nprint(\\\"DataType\\\")\\nfiles_schema.json()['hasDataType']\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n6. We will now create a query that will count how many Gene Expression Quantification files there are for the previously selected Case IDs and the RNA-Seq experimental strategy:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"files_query = {\\n    \\\"entity\\\": \\\"files\\\",\\n    \\\"hasCase\\\": cases,\\n    \\\"hasExperimentalStrategy\\\": \\\"RNA-Seq\\\",\\n    \\\"hasDataType\\\": \\\"Gene Expression Quantification\\\",\\n    \\\"hasAccessLevel\\\": \\\"Open\\\"\\n}\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n7. Let's execute the query:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"result_files_count = requests.post(datasets_api_url + 'tcga_grch38/v0/query/total', json.dumps(files_query), headers=headers)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n8. Let's print out the result:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"result_files_count.text\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nIf everything went well, you should get the following output:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"'{\\\"total\\\": 9}'\",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\n9. Now we'll fetch details for the nine files that match our query:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"result_files = requests.post(datasets_api_url + 'tcga_grch38/v0/query', json.dumps(files_query), headers=headers)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nAnd create lists containing file IDs and file names:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"file_ids = []\\nfile_names = []\\n\\nfor file in result_files.json()['_embedded']['files']:\\n    file_names.append(file[\\\"label\\\"])\\n    file_ids.append(file[\\\"id\\\"])\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n10. Let's print out the list containing parsed file names:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"file_names\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nYou should get a list of all nine file names:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"['54401c2a-7124-42b7-90d8-6267575bce51.FPKM-UQ.txt.gz',\\n '411e5567-82fd-4cc8-99ea-5ed9bd3a198e.FPKM.txt.gz',\\n '4d5b0ba8-64d8-404b-9a83-fc4111686afe.FPKM.txt.gz',\\n '54401c2a-7124-42b7-90d8-6267575bce51.htseq.counts.gz',\\n '411e5567-82fd-4cc8-99ea-5ed9bd3a198e.htseq.counts.gz',\\n '411e5567-82fd-4cc8-99ea-5ed9bd3a198e.FPKM-UQ.txt.gz',\\n '4d5b0ba8-64d8-404b-9a83-fc4111686afe.htseq.counts.gz',\\n '4d5b0ba8-64d8-404b-9a83-fc4111686afe.FPKM-UQ.txt.gz',\\n '54401c2a-7124-42b7-90d8-6267575bce51.FPKM.txt.gz']\",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\n11. Finally, let's copy the files to a project on the CGC so they can be used for further analyses. We will start by importing the **sevenbridges-python** library that is available by default in Data Cruncher:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"import sevenbridges as sbg\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n12. Let's set up the needed parameters for the CGC API:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"api = sbg.Api(url='https://cgc-api.sbgenomics.com/v2', token=token)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n13. Select the destination project on the CGC:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"my_project_name = 'test'\\n\\n# Find your project\\nmy_project = [p for p in api.projects.query(limit=100).all() if p.name == my_project_name][0]\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nPlease make sure to replace `test` with the actual name of your project on the CGC.\n\nLet's confirm that we have selected an existing project:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"my_project\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nThis should return an output similar to:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"<Project: id=rfranklin/test>\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n14. We will now do the actual copying of files to the defined project:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"for file in file_ids:\\n    f = api.files.get(file)\\n    f.copy(project=my_project)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nAnd let's verify that the files are in the project:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"my_files = api.files.query(limit = 100, project = my_project.id).all()\\n\\nfor file in my_files:\\n    print(file.name)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nIf everything went well, the result should include the following files:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"411e5567-82fd-4cc8-99ea-5ed9bd3a198e.FPKM-UQ.txt.gz\\n411e5567-82fd-4cc8-99ea-5ed9bd3a198e.FPKM.txt.gz\\n411e5567-82fd-4cc8-99ea-5ed9bd3a198e.htseq.counts.gz\\n4d5b0ba8-64d8-404b-9a83-fc4111686afe.FPKM-UQ.txt.gz\\n4d5b0ba8-64d8-404b-9a83-fc4111686afe.FPKM.txt.gz\\n4d5b0ba8-64d8-404b-9a83-fc4111686afe.htseq.counts.gz\\n54401c2a-7124-42b7-90d8-6267575bce51.FPKM-UQ.txt.gz\\n54401c2a-7124-42b7-90d8-6267575bce51.FPKM.txt.gz\\n54401c2a-7124-42b7-90d8-6267575bce51.htseq.counts.gz\",\n      \"language\": \"text\"\n    }\n  ]\n}\n[/block]\nIf your project already contained some files, the returned list will also include those files along with the newly-copied ones.\n\nThe procedure above fetches and copies files that are classified as [Open Data](doc:dbgap-controlled-data-access#section-open-data) and are available for all CGC users. These are aggregated files and can be used for further analyses using Data Cruncher on the CGC. To obtain files containing Aligned Reads from the TCGA GRCh38 dataset, you will need to have access to [Controlled Data](doc:dbgap-controlled-data-access#section-controlled-data) through [dbGaP](https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login). If the account you are using to log in to the CGC has access to Controlled Data, the query to get Aligned Reads files (step 6 above) should be:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"files_query = {\\n    \\\"entity\\\": \\\"files\\\",\\n    \\\"hasCase\\\": cases,\\n    \\\"hasExperimentalStrategy\\\": \\\"RNA-Seq\\\",\\n    \\\"hasDataType\\\": \\\"Aligned Reads\\\",\\n    \\\"hasAccessLevel\\\": \\\"Controlled\\\"\\n}\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nThe rest of the procedure used to find and copy the files to your project on the CGC is the same as the one for Open Data described above.","updates":[],"order":5,"isReference":false,"hidden":false,"sync_unique":"","link_url":"","link_external":false,"_id":"5d4c1907c2f6510047fc859b","project":"55faf11ba62ba1170021a9a7","version":{"version":"1.0","version_clean":"1.0.0","codename":"","is_stable":true,"is_beta":true,"is_hidden":false,"is_deprecated":false,"categories":["55faf11ca62ba1170021a9ab","55faf8f4d0e22017005b8272","55faf91aa62ba1170021a9b5","55faf929a8a7770d00c2c0bd","55faf932a8a7770d00c2c0bf","55faf94b17b9d00d00969f47","55faf958d0e22017005b8274","55faf95fa8a7770d00c2c0c0","55faf96917b9d00d00969f48","55faf970a8a7770d00c2c0c1","55faf98c825d5f19001fa3a6","55faf99aa62ba1170021a9b8","55faf99fa62ba1170021a9b9","55faf9aa17b9d00d00969f49","55faf9b6a8a7770d00c2c0c3","55faf9bda62ba1170021a9ba","5604570090ee490d00440551","5637e8b2fbe1c50d008cb078","5649bb624fa1460d00780add","5671974d1b6b730d008b4823","5671979d60c8e70d006c9760","568e8eef70ca1f0d0035808e","56d0a2081ecc471500f1795e","56d4a0adde40c70b00823ea3","56d96b03dd90610b00270849","56fbb83d8f21c817002af880","573c811bee2b3b2200422be1","576bc92afb62dd20001cda85","5771811e27a5c20e00030dcd","5785191af3a10c0e009b75b0","57bdf84d5d48411900cd8dc0","57ff5c5dc135231700aed806","5804caf792398f0f00e77521","58458b4fba4f1c0f009692bb","586d3c287c6b5b2300c05055","58ef66d88646742f009a0216","58f5d52d7891630f00fe4e77","59a555bccdbd85001bfb1442","5a2a81f688574d001e9934f5","5b080c8d7833b20003ddbb6f","5c222bed4bc358002f21459a","5c22412594a2a5005cc9e919","5c41ae1c33592700190a291e","5c8a525e2ba7b2003f9b153c","5cbf14d58c79c700ef2b502e","5db6f03a6e187c006f667fa4"],"_id":"55faf11ba62ba1170021a9aa","releaseDate":"2015-09-17T16:58:03.490Z","createdAt":"2015-09-17T16:58:03.490Z","project":"55faf11ba62ba1170021a9a7","__v":46},"category":{"sync":{"isSync":false,"url":""},"pages":[],"title":"TUTORIALS","slug":"tutorials","order":1,"from_sync":false,"reference":false,"_id":"56fbb83d8f21c817002af880","version":"55faf11ba62ba1170021a9aa","createdAt":"2016-03-30T11:27:57.862Z","__v":0,"project":"55faf11ba62ba1170021a9a7"},"user":"5767bc73bb15f40e00a28777","createdAt":"2019-08-08T12:43:51.105Z","__v":0,"parentDoc":null}

Fetch metadata from the PDC API


[block:callout] { "type": "info", "title": "", "body": "This tutorial provides an example of how to fetch metadata for data imported from the PDC. Learn how to [import data from the PDC](doc:import-from-the-pdc)." } [/block] ### Introduction This short tutorial will show how you can use Data Cruncher and the PDC API to get the desired metadata for proteomic data [imported from the PDC](https://pdc.cancer.gov/pdc/browse). Data Cruncher is an interactive environment within the CGC that allows you perform further analyses of your data using JupyterLab or RStudio. The PDC API will retrieve the most up to date metadata for the files in your CGC projects. We will also show how you can find the corresponding genomic data from the TCGA GRCh38 dataset for the proteomic data from the PDC and how you can import the data into a project on the CGC. The use case shown in the tutorial is for demonstration purposes, you can customize the code to retrieve metadata of you choice. ### Prerequisites * An active account on the CGC. ### Steps 1. [Create a project on the CGC](#section-1-create-a-project-on-the-cgc). 2. [Use Data Cruncher to retrieve metadata from the PDC API](#section-2-use-data-cruncher-to-extract-metadata-from-the-pdc-api). 3. [Create iTRAQ4 to Case Submitter ID mapping for RMS files](#section-create-itraq4-to-case-submitter-id-mapping-for-rms-files) 4. [Query the TCGA GRCh38 dataset using the CGC Datasets API](#section-create-itraq4-to-case-submitter-id-mapping-for-rms-files) ### 1. Create a project on the CGC 1. On the CGC home page, click **Projects**. 2. In the bottom-right corner of the dropdown menu click**+ Create a project**. Project creation dialog opens. 3. Name the project **PDC Metadata**. [block:image] { "images": [ { "image": [ "https://files.readme.io/b56c16b-pdc-tutorial-1.png", "pdc-tutorial-1.png", 477, 482, "#e1e9ea" ] } ] } [/block] 4. Keep the predefined values for other settings. If you want to learn more about project settings, see [more details](doc:modify-project-settings). 5. Click **Create**. The project is now created and you are taken to the [project dashboard](doc:manage-the-project-dashboard). The next step is to create a Data Cruncher analysis and use it to retrieve PDC metadata. ### 2. Use Data Cruncher to extract metadata from the PDC API We've created an example tutorial that will show you how to fetch metadata for an interactive analysis. This part of the tutorial incorporates and demonstrates a part of the code and instructions that are originally available in the [PDC Clustergram example](https://pdc.cancer.gov/API_documentation/PDC_clustergram.html) from PDC documentation.  1. In the **PDC Metadata** project, click the **Interactive analysis** tab on the right. 2. On the **Data Cruncher** card click **Open**. You are taken to the Data Cruncher. 3. In the top-right corner, click **Create new analysis**. If you don't have any previous analyses, you will see the **Create your first analysis button** in the center of the screen. 4. In the **Analysis name** field, enter **PDC Metadata**.  5. Select **JupyterLab** as the analysis environment. 6. Click **Next**. 7. Keep the predefined values for **Compute requirements** and click **Start the analysis**. The CGC will start acquiring an adequate instance for your analysis, which may take a few minutes. Once the analysis is ready, you will be notified. 8. Run your analysis by clicking **Open in editor** in the top-right corner. JupyterLab opens. 9. In the **Notebook** section on the JupyterLab home screen, select **Python 3**. You can now start entering the analysis code in the cells. Please note that each time a block of code is entered, it needs to be executed by clicking <img src="https://files.readme.io/463bd79-run.png" width="auto" align="inline" style="margin:1px"/> or by pressing Shift + Enter on the keyboard. 10. Let's begin by importing the `requests` and `json` modules in Python: [block:code] { "codes": [ { "code": "import requests\nimport json", "language": "python" } ] } [/block] 11. We will now add the function that will be used to query the PDC API: [block:code] { "codes": [ { "code": "def query_pdc(query):\n URL = \"https://pdc.cancer.gov/graphql\"\n # Send the POST graphql query\n print('Sending query.')\n pdc_response = requests.post(URL, json={'query': query})\n\n # Set up a data structure for the query result\n decoded = dict()\n\n # Check the results\n if pdc_response.ok:\n # Decode the response\n decoded = pdc_response.json()\n else:\n # Response not OK, see error\n pdc_response.raise_for_status()\n return decoded", "language": "python" } ] } [/block] 12. Let's create the query and use the function above to fetch all proteome case IDs from the **CPTAC2 Retrospective** project: [block:code] { "codes": [ { "code": "cases = []\nfor offset in range(0, 1000, 100):\n \n cases_query = \"\"\"{\"\"\" + \"\"\"getPaginatedUICase(project_name:\"CPTAC2 Retrospective\", \n analytical_fraction: \"Proteome\",\n primary_site: \"Breast\",\n disease_type: \"Breast Invasive Carcinoma\",\n limit: 100, offset: {})\"\"\".format(offset) + \"\"\"{\n uiCases {\n case_id\n project_name\n\n }\n } \n }\"\"\"\n \n result = query_pdc(cases_query)\n \n for case in result['data']['getPaginatedUICase']['uiCases']:\n cases.append(case['case_id'])\n\n \nprint('\\nNumber of cases:' + str(len(set(cases))))", "language": "python" } ] } [/block] This outputs query status information and the number of returned cases: [block:code] { "codes": [ { "code": "Sending query.\nSending query.\nSending query.\nSending query.\nSending query.\nSending query.\nSending query.\nSending query.\nSending query.\nSending query.\n\nNumber of cases:108", "language": "text" } ] } [/block] 13. We will now import the `pandas` library that will allow us to format and output data: [block:code] { "codes": [ { "code": "import pandas as pd", "language": "python" } ] } [/block] If the library is not available, install it using the [pip command](https://packaging.python.org/tutorials/installing-packages/#use-pip-for-installing). 14. Next, let's set up the **Submitter ID** parameter for the given study, which we will use to retrieve the corresponding set of metadata. [block:code] { "codes": [ { "code": "study_submitter_id = \"S015-1\" # S015-1 is TCGA_Breast_Cancer_Proteome", "language": "python" } ] } [/block] 15. Now we'll create a query for clinical metadata using the defined query parameters: [block:code] { "codes": [ { "code": "metadata_query = '''\n {\n clinicalMetadata(study_submitter_id: \"''' + study_submitter_id + '''\") {\n aliquot_submitter_id\n morphology\n primary_diagnosis\n tumor_grade\n tumor_stage\n }\n }\n '''", "language": "python" } ] } [/block] 16. Next, let's query the PDC API for the clinical metadata. The data is then converted to a pandas dataframe. [block:code] { "codes": [ { "code": "decoded = query_pdc(metadata_query)\nmatrix = decoded['data']['clinicalMetadata']\nmetadata = pd.DataFrame(matrix, columns=matrix[0]).set_index('aliquot_submitter_id')\nprint('Created a dataframe of these dimensions: {}'.format(metadata.shape))", "language": "python" } ] } [/block] 17. Finally, we'll print out the metadata fetched from the PDC and converted into a pandas dataframe: [block:code] { "codes": [ { "code": "print(metadata)", "language": "python" } ] } [/block] The output should be: [block:code] { "codes": [ { "code": "aliquot_submitter_id morphology primary_diagnosis \\ \nTCGA-AO-A12B-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \nTCGA-AO-A12B-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \nTCGA-AO-A12D-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \nTCGA-AO-A12D-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \nTCGA-AO-A12D-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \n... ... ... \nTCGA-AO-A0JE-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \nTCGA-AO-A0JJ-01A-31-A21W-30 8520/3 Lobular carcinoma, NOS \nTCGA-AO-A0JL-01A-41-A21W-30 8500/3 Infiltrating duct carcinoma, NOS \nTCGA-AO-A0JM-01A-41-A21V-30 8500/3 Infiltrating duct carcinoma, NOS \nTCGA-AO-A126-01A-22-A21W-30 8500/3 Infiltrating duct carcinoma, NOS \n\n\naliquot_submitter_id tumor_grade tumor_stage \nTCGA-AO-A12B-01A-41-A21V-30 Not Reported stage iia \nTCGA-AO-A12B-01A-41-A21V-30 Not Reported stage iia \nTCGA-AO-A12D-01A-41-A21V-30 Not Reported stage iia \nTCGA-AO-A12D-01A-41-A21V-30 Not Reported stage iia \nTCGA-AO-A12D-01A-41-A21V-30 Not Reported stage iia \n... ... ... \nTCGA-AO-A0JE-01A-41-A21V-30 Not Reported stage iiia \nTCGA-AO-A0JJ-01A-31-A21W-30 Not Reported stage iib \nTCGA-AO-A0JL-01A-41-A21W-30 Not Reported stage iiia \nTCGA-AO-A0JM-01A-41-A21V-30 Not Reported stage iib \nTCGA-AO-A126-01A-22-A21W-30 Not Reported stage iia \n\n[152 rows x 4 columns]", "language": "text" } ] } [/block] The output above has been truncated for brevity. All 152 rows will be displayed in your analysis. ### 3. Create iTRAQ4 to Case Submitter ID mapping for RMS files The objective of this section of the tutorial is to demonstrate how to filter files (by type), fetch a file's metadata and use it to create iTRAQ4 to Case Submitter ID matches. In this case, we will be fetching data about Raw Mass Spectra (RMS) files using the same Study Submitter ID used during the course of this tutorial, `S015-1`. 1. Create a query that will be used to fetch all data about RMS files from the study: [block:code] { "codes": [ { "code": "rms_query = \"\"\"{filesPerStudy(study_submitter_id: \"S015-1\",\n data_category:\"Raw Mass Spectra\") {\n study_id\n study_name\n file_id\n data_category\n}}\"\"\"", "language": "python" } ] } [/block] 2. Execute the query: [block:code] { "codes": [ { "code": "rms = query_pdc(rms_query)", "language": "python" } ] } [/block] 3. Parse returned data: [block:code] { "codes": [ { "code": "### We now have all Raw Mass Spectra files for S015-1\nrms_result = rms['data']['filesPerStudy']\n\n# Show first 5 files\nfor file in rms_result[:5]:\n print(file)\n print('\\n')", "language": "python" } ] } [/block] The returned output should be: [block:code] { "codes": [ { "code": "{'study_id': 'b8da9eeb-57b8-11e8-b07a-00a098d917f8', 'study_name': 'TCGA_Breast_Cancer_Proteome', 'file_id': '00064e82-5ceb-11e9-849f-005056921935', 'data_category': 'Raw Mass Spectra'}\n\n\n{'study_id': 'b8da9eeb-57b8-11e8-b07a-00a098d917f8', 'study_name': 'TCGA_Breast_Cancer_Proteome', 'file_id': '0029431e-5cb0-11e9-849f-005056921935', 'data_category': 'Raw Mass Spectra'}\n\n\n{'study_id': 'b8da9eeb-57b8-11e8-b07a-00a098d917f8', 'study_name': 'TCGA_Breast_Cancer_Proteome', 'file_id': '003914a0-5cd9-11e9-849f-005056921935', 'data_category': 'Raw Mass Spectra'}\n\n\n{'study_id': 'b8da9eeb-57b8-11e8-b07a-00a098d917f8', 'study_name': 'TCGA_Breast_Cancer_Proteome', 'file_id': '00466270-5cef-11e9-849f-005056921935', 'data_category': 'Raw Mass Spectra'}\n\n\n{'study_id': 'b8da9eeb-57b8-11e8-b07a-00a098d917f8', 'study_name': 'TCGA_Breast_Cancer_Proteome', 'file_id': '005db03e-5cdf-11e9-849f-005056921935', 'data_category': 'Raw Mass Spectra'}", "language": "text" } ] } [/block] 4. Select the first RMS file from the returned data: [block:code] { "codes": [ { "code": "# Pick first Raw Mass Spectra file from S015-1\nfile_id = rms_result[0]['file_id']", "language": "python" } ] } [/block] 5. Fetch metadata for the selected file ID: [block:code] { "codes": [ { "code": "file_query=\"\"\"{\"\"\" + \"\"\"fileMetadata(file_id: \"{}\")\"\"\".format(file_id) + \"\"\"{\n file_name\n file_size\n md5sum\n file_location\n file_submitter_id\n fraction_number\n experiment_type\n data_category\n file_type\n file_format\n plex_or_dataset_name\n analyte\n instrument\n aliquots { \n aliquot_id\n aliquot_submitter_id\n label\n sample_id\n sample_submitter_id\n case_id\n case_submitter_id\n }\n }\n }\"\"\"", "language": "python" } ] } [/block] 6. Execute the metadata query: [block:code] { "codes": [ { "code": "result = query_pdc(file_query)", "language": "python" } ] } [/block] 7. Create a dictionary that contains the iTRAQ4 to Case Submitter ID mapping: [block:code] { "codes": [ { "code": "label_map = {}\naliquots = result['data']['fileMetadata'][0]['aliquots']\nfor aliquot in aliquots:\n print(aliquot)\n print(\"\\n\")\n label = aliquot['label']\n if aliquot['case_id']:\n case_submitter_id = aliquot['case_submitter_id']\n label_map[label] = case_submitter_id\n else:\n label_map[label] = 'Reference'", "language": "python" } ] } [/block] This code block returns the following output: [block:code] { "codes": [ { "code": "{'aliquot_id': '34317a3a-6429-11e8-bcf1-0a2705229b82', 'aliquot_submitter_id': 'TCGA-C8-A12V-01A-41-A21V-30', 'label': 'iTRAQ4 115', 'sample_id': '83929378-6420-11e8-bcf1-0a2705229b82', 'sample_submitter_id': 'TCGA-C8-A12V-01A', 'case_id': 'f49043f6-63d8-11e8-bcf1-0a2705229b82', 'case_submitter_id': 'TCGA-C8-A12V'}\n\n\n{'aliquot_id': 'fd50c409-6428-11e8-bcf1-0a2705229b82', 'aliquot_submitter_id': 'TCGA-AO-A0JM-01A-41-A21V-30', 'label': 'iTRAQ4 114', 'sample_id': '35800901-6420-11e8-bcf1-0a2705229b82', 'sample_submitter_id': 'TCGA-AO-A0JM-01A', 'case_id': 'c10409cb-63d8-11e8-bcf1-0a2705229b82', 'case_submitter_id': 'TCGA-AO-A0JM'}\n\n\n{'aliquot_id': '6a47a426-ec51-11e9-81b4-2a2ae2dbcce4', 'aliquot_submitter_id': 'Internal Reference', 'label': 'iTRAQ4 117', 'sample_id': '6a479058-ec51-11e9-81b4-2a2ae2dbcce4', 'sample_submitter_id': 'Internal Reference', 'case_id': '6a477ef6-ec51-11e9-81b4-2a2ae2dbcce4', 'case_submitter_id': 'Internal Reference'}\n\n\n{'aliquot_id': '6dbd35fe-6428-11e8-bcf1-0a2705229b82', 'aliquot_submitter_id': 'TCGA-A8-A08G-01A-13-A21W-30', 'label': 'iTRAQ4 116', 'sample_id': '78f43d0f-641f-11e8-bcf1-0a2705229b82', 'sample_submitter_id': 'TCGA-A8-A08G-01A', 'case_id': '3bdfde9b-63d8-11e8-bcf1-0a2705229b82', 'case_submitter_id': 'TCGA-A8-A08G'}", "language": "text" } ] } [/block] 8. Let's print out the dictionary: [block:code] { "codes": [ { "code": "label_map", "language": "python" } ] } [/block] Here's what the created dictionary looks like: [block:code] { "codes": [ { "code": "{'iTRAQ4 115': 'TCGA-C8-A12V',\n 'iTRAQ4 114': 'TCGA-AO-A0JM',\n 'iTRAQ4 117': 'Internal Reference',\n 'iTRAQ4 116': 'TCGA-A8-A08G'}", "language": "python" } ] } [/block] ### 4. Query the TCGA GRCh38 dataset using the CGC Datasets API The final section of the tutorial aims to show how you can connect proteomic data obtained from the PDC to genomics data from the TCGA GRCh38 dataset using the `SubmitterId` property. 1. Set up the API URL and your authentication token. Learn how to [get your authentication token](doc:get-your-authentication-token). [block:code] { "codes": [ { "code": "# Query TCGA GRCh38 dataset on Datasets API\ndatasets_api_url = 'https://cgc-datasets-api.sbgenomics.com/datasets/'\n\n# Link to Datasets API docs for obtaining AUTH TOKEN\ntoken = 'MY_AUTH_TOKEN'", "language": "python" } ] } [/block] Please make sure to replace `MY_AUTH_TOKEN` with your authentication token obtained from the CGC. 2. Set up the query to find cases for the submitter IDs in the TCGA GRCh38 dataset on the CGC, using the Datasets API: [block:code] { "codes": [ { "code": "query = {\n \"entity\": \"cases\",\n \"hasSubmitterId\": [\"TCGA-C8-A12V\", \"TCGA-AO-A0JM\", \"TCGA-A8-A08G\"]\n}\nheaders = {\"X-SBG-Auth-Token\": token}", "language": "python" } ] } [/block] In this query, we are using Case Submitter IDs as the linking property to find the corresponding genomic data (Case UUIDs) for the obtained proteomic information from the PDC.  3. Let's execute the query: [block:code] { "codes": [ { "code": "result = requests.post(datasets_api_url + 'tcga_grch38/v0/query', json.dumps(query), headers=headers)", "language": "python" } ] } [/block] 4. We will now parse and print Case UUID's from the result.  [block:code] { "codes": [ { "code": "cases = []\n\nfor case in result.json()[\"_embedded\"][\"cases\"]:\n cases.append(case['label'])\n\nprint(cases)", "language": "python" } ] } [/block] This returns a list of three items: [block:code] { "codes": [ { "code": "['44bec761-b603-49c0-8634-f6bfe0319bb1', '719082cc-1ebe-4a51-a659-85a59db1d77d', '7e1673f8-5758-4963-8804-d5e39f06205b']", "language": "text" } ] } [/block] 5. Now, let's fetch and list all experimental strategies and data types available on the CGC for the TCGA GRCh38 dataset: [block:code] { "codes": [ { "code": "# List all experimental strategies and Data types\nfiles_schema = requests.get(datasets_api_url + 'tcga_grch38/v0/files/schema', headers=headers)\n\n# Experimental Strategy\nprint(\"Experimental strategy\")\nprint(files_schema.json()['hasExperimentalStrategy'])\n\n# DataType\nprint(\"\\n\")\nprint(\"DataType\")\nfiles_schema.json()['hasDataType']", "language": "python" } ] } [/block] 6. We will now create a query that will count how many Gene Expression Quantification files there are for the previously selected Case IDs and the RNA-Seq experimental strategy: [block:code] { "codes": [ { "code": "files_query = {\n \"entity\": \"files\",\n \"hasCase\": cases,\n \"hasExperimentalStrategy\": \"RNA-Seq\",\n \"hasDataType\": \"Gene Expression Quantification\",\n \"hasAccessLevel\": \"Open\"\n}", "language": "python" } ] } [/block] 7. Let's execute the query: [block:code] { "codes": [ { "code": "result_files_count = requests.post(datasets_api_url + 'tcga_grch38/v0/query/total', json.dumps(files_query), headers=headers)", "language": "python" } ] } [/block] 8. Let's print out the result: [block:code] { "codes": [ { "code": "result_files_count.text", "language": "python" } ] } [/block] If everything went well, you should get the following output: [block:code] { "codes": [ { "code": "'{\"total\": 9}'", "language": "text" } ] } [/block] 9. Now we'll fetch details for the nine files that match our query: [block:code] { "codes": [ { "code": "result_files = requests.post(datasets_api_url + 'tcga_grch38/v0/query', json.dumps(files_query), headers=headers)", "language": "python" } ] } [/block] And create lists containing file IDs and file names: [block:code] { "codes": [ { "code": "file_ids = []\nfile_names = []\n\nfor file in result_files.json()['_embedded']['files']:\n file_names.append(file[\"label\"])\n file_ids.append(file[\"id\"])", "language": "python" } ] } [/block] 10. Let's print out the list containing parsed file names: [block:code] { "codes": [ { "code": "file_names", "language": "python" } ] } [/block] You should get a list of all nine file names: [block:code] { "codes": [ { "code": "['54401c2a-7124-42b7-90d8-6267575bce51.FPKM-UQ.txt.gz',\n '411e5567-82fd-4cc8-99ea-5ed9bd3a198e.FPKM.txt.gz',\n '4d5b0ba8-64d8-404b-9a83-fc4111686afe.FPKM.txt.gz',\n '54401c2a-7124-42b7-90d8-6267575bce51.htseq.counts.gz',\n '411e5567-82fd-4cc8-99ea-5ed9bd3a198e.htseq.counts.gz',\n '411e5567-82fd-4cc8-99ea-5ed9bd3a198e.FPKM-UQ.txt.gz',\n '4d5b0ba8-64d8-404b-9a83-fc4111686afe.htseq.counts.gz',\n '4d5b0ba8-64d8-404b-9a83-fc4111686afe.FPKM-UQ.txt.gz',\n '54401c2a-7124-42b7-90d8-6267575bce51.FPKM.txt.gz']", "language": "text" } ] } [/block] 11. Finally, let's copy the files to a project on the CGC so they can be used for further analyses. We will start by importing the **sevenbridges-python** library that is available by default in Data Cruncher: [block:code] { "codes": [ { "code": "import sevenbridges as sbg", "language": "python" } ] } [/block] 12. Let's set up the needed parameters for the CGC API: [block:code] { "codes": [ { "code": "api = sbg.Api(url='https://cgc-api.sbgenomics.com/v2', token=token)", "language": "python" } ] } [/block] 13. Select the destination project on the CGC: [block:code] { "codes": [ { "code": "my_project_name = 'test'\n\n# Find your project\nmy_project = [p for p in api.projects.query(limit=100).all() if p.name == my_project_name][0]", "language": "python" } ] } [/block] Please make sure to replace `test` with the actual name of your project on the CGC. Let's confirm that we have selected an existing project: [block:code] { "codes": [ { "code": "my_project", "language": "python" } ] } [/block] This should return an output similar to: [block:code] { "codes": [ { "code": "<Project: id=rfranklin/test>", "language": "python" } ] } [/block] 14. We will now do the actual copying of files to the defined project: [block:code] { "codes": [ { "code": "for file in file_ids:\n f = api.files.get(file)\n f.copy(project=my_project)", "language": "python" } ] } [/block] And let's verify that the files are in the project: [block:code] { "codes": [ { "code": "my_files = api.files.query(limit = 100, project = my_project.id).all()\n\nfor file in my_files:\n print(file.name)", "language": "python" } ] } [/block] If everything went well, the result should include the following files: [block:code] { "codes": [ { "code": "411e5567-82fd-4cc8-99ea-5ed9bd3a198e.FPKM-UQ.txt.gz\n411e5567-82fd-4cc8-99ea-5ed9bd3a198e.FPKM.txt.gz\n411e5567-82fd-4cc8-99ea-5ed9bd3a198e.htseq.counts.gz\n4d5b0ba8-64d8-404b-9a83-fc4111686afe.FPKM-UQ.txt.gz\n4d5b0ba8-64d8-404b-9a83-fc4111686afe.FPKM.txt.gz\n4d5b0ba8-64d8-404b-9a83-fc4111686afe.htseq.counts.gz\n54401c2a-7124-42b7-90d8-6267575bce51.FPKM-UQ.txt.gz\n54401c2a-7124-42b7-90d8-6267575bce51.FPKM.txt.gz\n54401c2a-7124-42b7-90d8-6267575bce51.htseq.counts.gz", "language": "text" } ] } [/block] If your project already contained some files, the returned list will also include those files along with the newly-copied ones. The procedure above fetches and copies files that are classified as [Open Data](doc:dbgap-controlled-data-access#section-open-data) and are available for all CGC users. These are aggregated files and can be used for further analyses using Data Cruncher on the CGC. To obtain files containing Aligned Reads from the TCGA GRCh38 dataset, you will need to have access to [Controlled Data](doc:dbgap-controlled-data-access#section-controlled-data) through [dbGaP](https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login). If the account you are using to log in to the CGC has access to Controlled Data, the query to get Aligned Reads files (step 6 above) should be: [block:code] { "codes": [ { "code": "files_query = {\n \"entity\": \"files\",\n \"hasCase\": cases,\n \"hasExperimentalStrategy\": \"RNA-Seq\",\n \"hasDataType\": \"Aligned Reads\",\n \"hasAccessLevel\": \"Controlled\"\n}", "language": "python" } ] } [/block] The rest of the procedure used to find and copy the files to your project on the CGC is the same as the one for Open Data described above.