{"_id":"5d4c1907c2f6510047fc859b","project":"55faf11ba62ba1170021a9a7","version":{"_id":"55faf11ba62ba1170021a9aa","project":"55faf11ba62ba1170021a9a7","__v":45,"createdAt":"2015-09-17T16:58:03.490Z","releaseDate":"2015-09-17T16:58:03.490Z","categories":["55faf11ca62ba1170021a9ab","55faf8f4d0e22017005b8272","55faf91aa62ba1170021a9b5","55faf929a8a7770d00c2c0bd","55faf932a8a7770d00c2c0bf","55faf94b17b9d00d00969f47","55faf958d0e22017005b8274","55faf95fa8a7770d00c2c0c0","55faf96917b9d00d00969f48","55faf970a8a7770d00c2c0c1","55faf98c825d5f19001fa3a6","55faf99aa62ba1170021a9b8","55faf99fa62ba1170021a9b9","55faf9aa17b9d00d00969f49","55faf9b6a8a7770d00c2c0c3","55faf9bda62ba1170021a9ba","5604570090ee490d00440551","5637e8b2fbe1c50d008cb078","5649bb624fa1460d00780add","5671974d1b6b730d008b4823","5671979d60c8e70d006c9760","568e8eef70ca1f0d0035808e","56d0a2081ecc471500f1795e","56d4a0adde40c70b00823ea3","56d96b03dd90610b00270849","56fbb83d8f21c817002af880","573c811bee2b3b2200422be1","576bc92afb62dd20001cda85","5771811e27a5c20e00030dcd","5785191af3a10c0e009b75b0","57bdf84d5d48411900cd8dc0","57ff5c5dc135231700aed806","5804caf792398f0f00e77521","58458b4fba4f1c0f009692bb","586d3c287c6b5b2300c05055","58ef66d88646742f009a0216","58f5d52d7891630f00fe4e77","59a555bccdbd85001bfb1442","5a2a81f688574d001e9934f5","5b080c8d7833b20003ddbb6f","5c222bed4bc358002f21459a","5c22412594a2a5005cc9e919","5c41ae1c33592700190a291e","5c8a525e2ba7b2003f9b153c","5cbf14d58c79c700ef2b502e"],"is_deprecated":false,"is_hidden":false,"is_beta":true,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"category":{"_id":"56fbb83d8f21c817002af880","version":"55faf11ba62ba1170021a9aa","__v":0,"project":"55faf11ba62ba1170021a9a7","sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-03-30T11:27:57.862Z","from_sync":false,"order":1,"slug":"tutorials","title":"TUTORIALS"},"user":"5767bc73bb15f40e00a28777","__v":0,"parentDoc":null,"metadata":{"title":"","description":"","image":[]},"updates":[],"next":{"pages":[],"description":""},"createdAt":"2019-08-08T12:43:51.105Z","link_external":false,"link_url":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":4,"body":"[block:callout]\n{\n  \"type\": \"info\",\n  \"title\": \"\",\n  \"body\": \"This tutorial provides an example of how to fetch metadata for data imported from the PDC. Learn how to [import data from the PDC](doc:import-from-the-pdc).\"\n}\n[/block]\n\n### Introduction\n\nThis short tutorial will show how you can use Data Cruncher and the `PDC_metadata.json` file from the Public Reference Files gallery to get the desired metadata for proteomic data [imported from the PDC](doc:import-from-the-pdc). The metadata file contains the set of metadata for data available on the PDC and its purpose is to provide direct availability of PDC metadata, without having to do additional downloading, conversion and uploading to bring the metadata to the CGC. The use case shown in the tutorial is for demonstration purposes, you can customize the code to retrieve metadata of you choice.\n\n### Prerequisites\n\n* An active account on the CGC.\n\n### Steps\n1. [Create a project on the CGC](#section-1-create-a-project-on-the-cgc).\n2. [Add the metadata file to your project](#section-2-add-the-metadata-file-to-your-project).\n3. [Use Data Cruncher to extract metadata from the file](#section-3-use-data-cruncher-to-extract-metadata-from-the-file).\n\n### 1. Create a project on the CGC\n1. On the CGC home page, click **Projects**.\n2. In the bottom-right corner of the dropdown menu click **+ Create a project**. Project creation dialog opens.\n3. Name the project **PDC Metadata**.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/9f0aae6-pdc-tutorial-1.png\",\n        \"pdc-tutorial-1.png\",\n        477,\n        482,\n        \"#e1e9ea\"\n      ]\n    }\n  ]\n}\n[/block]\n\n4. Keep the predefined values for other settings. If you want to learn more about project settings, see [more details](doc:modify-project-settings).\n5. Click **Create**. The project is now created and you are taken to the [project dashboard](doc:manage-the-project-dashboard).\n\nThe next step is to add the metadata file to the project.\n\n### 2. Add the metadata file to your project\n1. While in the **PDC Metadata** project, click the **Files** tab.\n2. Click **+ Add files**.\n3. In the search box, enter `PDC_metadata.json`. The file is displayed in the file list.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/7d0a8d4-pdc-tutorial-2-new.png\",\n        \"pdc-tutorial-2-new.png\",\n        484,\n        268,\n        \"#3b5b76\"\n      ]\n    }\n  ]\n}\n[/block]\n4. Select the checkbox next to the file.\n5. Click **Copy to project** in the top-right corner. \n6. (Optional) Add/remove [file tags](doc:tag-your-files).\n7. Click **Copy**. The file is now available in your project.\n\n### 3. Use Data Cruncher to extract metadata from the file\n\nData Cruncher is an interactive environment within the CGC that allows you perform further analyses of your data using JupyterLab or RStudio. In this tutorial, we will be using Python code inside a Jupyter notebook to fetch metadata from the PDC metadata file that is already available on the CGC.\n1. In the **PDC Metadata** project, click the **Interactive analysis** tab on the right.\n2. On the **Data Cruncher** card click **Open**. You are taken to the Data Cruncher.\n3. In the top-right corner, click **Create new analysis**. If you don't have any previous analyses, you will see the **Create your first analysis button** in the center of the screen.\n4. In the **Analysis name** field, enter **PDC Metadata**. \n5. Select **JupyterLab** as the analysis environment.\n6. Click **Next**.\n7. Keep the predefined values for **Compute requirements** and click **Start the analysis**. The CGC will start acquiring an adequate instance for your analysis, which may take a few minutes. Once the analysis is ready, you will be notified.\n8. Run your analysis by clicking **Open in editor** in the top-right corner. JupyterLab opens.\n9. In the **Notebook** section on the JupyterLab home screen, select **Python 3**. You can now start entering the analysis code in the cells. Please note that each time a block of code is entered, it needs to be executed by clicking  <i class=\"fa fa-play\" aria-hidden=\"true\"></i>  or by pressing Shift + Enter on the keyboard.\n10. Let's begin by importing the JSON module in Python:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"import json\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n11. We will now load the metadata file:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"with open('/sbgenomics/project-files/PDC_metadata.json') as pdc_json:\\n    pdc = json.load(pdc_json)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n12. We will assume that we have analyzed a specific raw file from the **CPTAC3 Discovery** project for which we know the file ID:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"file_id = '44beb078-f945-11e8-953d-005056921935'\\npdc[file_id]\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nThis outputs the metadata structure of the file (shortened to preserve space):\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"{'file_id': '44beb078-f945-11e8-953d-005056921935',\\n 'study_id': ['c935c587-0cd1-11e9-a064-0a9c39d33490'],\\n 'submitter_id_name': ['CPTAC UCEC Discovery Study - Proteome'],\\n 'file_name': '01CPTAC_UCEC_W_PNNL_20170922_B1S1_f01.raw',\\n ...\\n 'aliquots': [{'aliquot_id': '1fd8187c-1272-11e9-afb9-0a9c39d33490',\\n   'aliquot_submitter_id': 'CPT0080300003',\\n   'label': 'tmt10_130c',\\n   'sample_id': '26cc88f8-1259-11e9-afb9-0a9c39d33490',\\n   'sample_submitter_id': 'C3L-01248-01',\\n   'case_id': '619df278-118a-11e9-afb9-0a9c39d33490',\\n   ...\\n   'case': {'case_id': '619df278-118a-11e9-afb9-0a9c39d33490',\\n    'case_submitter_id': 'C3L-01248',\\n    'project_submitter_id': 'PJ-CPTAC3',\\n    ...\\n    'diagnoses': [{'diagnosis_id': 'adb6a5bd-0f5a-11e9-a064-0a9c39d33490',\\n      'diagnosis_submitter_id': 'C3L-01248-DIAG',\\n      ...],\\n    'demographics': [{'demographic_id': '47867d02-0f56-11e9-a064-0a9c39d33490',\\n      'demographic_submitter_id': 'C3L-01248-DEMO',\\n      'ethnicity': 'not reported',\\n      ...],\\n    'sample': {'sample_id': '26cc88f8-1259-11e9-afb9-0a9c39d33490',\\n     'gdc_sample_id': None,\\n     'gdc_project_id': None,\\n     ...}}},\\n     ...\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\nAs you can see, there are multiple levels of nested objects:\n\n* The first level contains some general information about **file** such as `file_name`, `project_name` etc.\n* The second level contains **aliquot** information (`aliquot_id`, `label`...).\n* **case** information is the third level located under **aliquot** (`case_id`, `case_submitter_id`...).\n* Finally, **diagnoses**, **demographics** and **sample** information is contained within **case** and represents the fourth level.\n\n13. Having the above in mind, we can write the following code:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"label_map = {}\\naliquots = pdc[file_id]['aliquots']\\nfor aliquot in aliquots:\\n    label = aliquot['label']\\n    if aliquot['case_id']:\\n        case_id = aliquot['case_id']\\n        label_map[label] = case_id\\n    else:\\n        label_map[label] = 'Reference'\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nWe get a dictionary that contains the **TMT10** to **Case ID** mapping. Let's print out the dictionary:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"label_map\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nHere's what the created dictionary looks like:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"{'tmt10_130c': '619df278-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_127n': 'b2a27cfc-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_129c': '5498017e-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_128c': '66ee6e22-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_131': 'b0effb8f-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_127c': 'b2a27cfc-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_130n': 'a7378d52-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_126': 'Reference',\\n 'tmt10_129n': '5498017e-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_128n': '66ee6e22-118a-11e9-afb9-0a9c39d33490'}\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\nIn this example, we used a specific `file_id` from the **CPTAC3 Discovery** project to get the metadata of interest. Now, let's try to fetch all proteome case IDs from the **CPTAC2 Confirmatory** project:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"conformatory_proteome_cases = []\\nfor file_id in pdc:\\n    if pdc[file_id]['project_name'] == 'CPTAC2 Confirmatory' and 'Proteome' in pdc[file_id]['analytical_fraction']:\\n        aliquots = pdc[file_id]['aliquots']\\n        for aliquot in aliquots:\\n            if aliquot['case_id'] and aliquot['case_id'] not in set(conformatory_proteome_cases):\\n                conformatory_proteome_cases.append(aliquot['case_id'])\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nIf we print out the length of the created list:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"len(conformatory_proteome_cases)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nThis returns the value of `338`, which is the same as the value obtained by querying the PDC Data Portal when the **CPTAC2 Conformatory** project is selected:\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/f9cc53b-pdc-tutorial-3.png\",\n        \"pdc-tutorial-3.png\",\n        1200,\n        487,\n        \"#e5e9eb\"\n      ]\n    }\n  ]\n}\n[/block]","excerpt":"","slug":"fetch-metadata-from-the-pdc-metadata-file","type":"basic","title":"Fetch metadata from the PDC metadata file"}

Fetch metadata from the PDC metadata file


[block:callout] { "type": "info", "title": "", "body": "This tutorial provides an example of how to fetch metadata for data imported from the PDC. Learn how to [import data from the PDC](doc:import-from-the-pdc)." } [/block] ### Introduction This short tutorial will show how you can use Data Cruncher and the `PDC_metadata.json` file from the Public Reference Files gallery to get the desired metadata for proteomic data [imported from the PDC](doc:import-from-the-pdc). The metadata file contains the set of metadata for data available on the PDC and its purpose is to provide direct availability of PDC metadata, without having to do additional downloading, conversion and uploading to bring the metadata to the CGC. The use case shown in the tutorial is for demonstration purposes, you can customize the code to retrieve metadata of you choice. ### Prerequisites * An active account on the CGC. ### Steps 1. [Create a project on the CGC](#section-1-create-a-project-on-the-cgc). 2. [Add the metadata file to your project](#section-2-add-the-metadata-file-to-your-project). 3. [Use Data Cruncher to extract metadata from the file](#section-3-use-data-cruncher-to-extract-metadata-from-the-file). ### 1. Create a project on the CGC 1. On the CGC home page, click **Projects**. 2. In the bottom-right corner of the dropdown menu click **+ Create a project**. Project creation dialog opens. 3. Name the project **PDC Metadata**. [block:image] { "images": [ { "image": [ "https://files.readme.io/9f0aae6-pdc-tutorial-1.png", "pdc-tutorial-1.png", 477, 482, "#e1e9ea" ] } ] } [/block] 4. Keep the predefined values for other settings. If you want to learn more about project settings, see [more details](doc:modify-project-settings). 5. Click **Create**. The project is now created and you are taken to the [project dashboard](doc:manage-the-project-dashboard). The next step is to add the metadata file to the project. ### 2. Add the metadata file to your project 1. While in the **PDC Metadata** project, click the **Files** tab. 2. Click **+ Add files**. 3. In the search box, enter `PDC_metadata.json`. The file is displayed in the file list. [block:image] { "images": [ { "image": [ "https://files.readme.io/7d0a8d4-pdc-tutorial-2-new.png", "pdc-tutorial-2-new.png", 484, 268, "#3b5b76" ] } ] } [/block] 4. Select the checkbox next to the file. 5. Click **Copy to project** in the top-right corner.  6. (Optional) Add/remove [file tags](doc:tag-your-files). 7. Click **Copy**. The file is now available in your project. ### 3. Use Data Cruncher to extract metadata from the file Data Cruncher is an interactive environment within the CGC that allows you perform further analyses of your data using JupyterLab or RStudio. In this tutorial, we will be using Python code inside a Jupyter notebook to fetch metadata from the PDC metadata file that is already available on the CGC. 1. In the **PDC Metadata** project, click the **Interactive analysis** tab on the right. 2. On the **Data Cruncher** card click **Open**. You are taken to the Data Cruncher. 3. In the top-right corner, click **Create new analysis**. If you don't have any previous analyses, you will see the **Create your first analysis button** in the center of the screen. 4. In the **Analysis name** field, enter **PDC Metadata**.  5. Select **JupyterLab** as the analysis environment. 6. Click **Next**. 7. Keep the predefined values for **Compute requirements** and click **Start the analysis**. The CGC will start acquiring an adequate instance for your analysis, which may take a few minutes. Once the analysis is ready, you will be notified. 8. Run your analysis by clicking **Open in editor** in the top-right corner. JupyterLab opens. 9. In the **Notebook** section on the JupyterLab home screen, select **Python 3**. You can now start entering the analysis code in the cells. Please note that each time a block of code is entered, it needs to be executed by clicking  <i class="fa fa-play" aria-hidden="true"></i> or by pressing Shift + Enter on the keyboard. 10. Let's begin by importing the JSON module in Python: [block:code] { "codes": [ { "code": "import json", "language": "python" } ] } [/block] 11. We will now load the metadata file: [block:code] { "codes": [ { "code": "with open('/sbgenomics/project-files/PDC_metadata.json') as pdc_json:\n pdc = json.load(pdc_json)", "language": "python" } ] } [/block] 12. We will assume that we have analyzed a specific raw file from the **CPTAC3 Discovery** project for which we know the file ID: [block:code] { "codes": [ { "code": "file_id = '44beb078-f945-11e8-953d-005056921935'\npdc[file_id]", "language": "python" } ] } [/block] This outputs the metadata structure of the file (shortened to preserve space): [block:code] { "codes": [ { "code": "{'file_id': '44beb078-f945-11e8-953d-005056921935',\n 'study_id': ['c935c587-0cd1-11e9-a064-0a9c39d33490'],\n 'submitter_id_name': ['CPTAC UCEC Discovery Study - Proteome'],\n 'file_name': '01CPTAC_UCEC_W_PNNL_20170922_B1S1_f01.raw',\n ...\n 'aliquots': [{'aliquot_id': '1fd8187c-1272-11e9-afb9-0a9c39d33490',\n 'aliquot_submitter_id': 'CPT0080300003',\n 'label': 'tmt10_130c',\n 'sample_id': '26cc88f8-1259-11e9-afb9-0a9c39d33490',\n 'sample_submitter_id': 'C3L-01248-01',\n 'case_id': '619df278-118a-11e9-afb9-0a9c39d33490',\n ...\n 'case': {'case_id': '619df278-118a-11e9-afb9-0a9c39d33490',\n 'case_submitter_id': 'C3L-01248',\n 'project_submitter_id': 'PJ-CPTAC3',\n ...\n 'diagnoses': [{'diagnosis_id': 'adb6a5bd-0f5a-11e9-a064-0a9c39d33490',\n 'diagnosis_submitter_id': 'C3L-01248-DIAG',\n ...],\n 'demographics': [{'demographic_id': '47867d02-0f56-11e9-a064-0a9c39d33490',\n 'demographic_submitter_id': 'C3L-01248-DEMO',\n 'ethnicity': 'not reported',\n ...],\n 'sample': {'sample_id': '26cc88f8-1259-11e9-afb9-0a9c39d33490',\n 'gdc_sample_id': None,\n 'gdc_project_id': None,\n ...}}},\n ...", "language": "json" } ] } [/block] As you can see, there are multiple levels of nested objects: * The first level contains some general information about **file** such as `file_name`, `project_name` etc. * The second level contains **aliquot** information (`aliquot_id`, `label`...). * **case** information is the third level located under **aliquot** (`case_id`, `case_submitter_id`...). * Finally, **diagnoses**, **demographics** and **sample** information is contained within **case** and represents the fourth level. 13. Having the above in mind, we can write the following code: [block:code] { "codes": [ { "code": "label_map = {}\naliquots = pdc[file_id]['aliquots']\nfor aliquot in aliquots:\n label = aliquot['label']\n if aliquot['case_id']:\n case_id = aliquot['case_id']\n label_map[label] = case_id\n else:\n label_map[label] = 'Reference'", "language": "python" } ] } [/block] We get a dictionary that contains the **TMT10** to **Case ID** mapping. Let's print out the dictionary: [block:code] { "codes": [ { "code": "label_map", "language": "python" } ] } [/block] Here's what the created dictionary looks like: [block:code] { "codes": [ { "code": "{'tmt10_130c': '619df278-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_127n': 'b2a27cfc-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_129c': '5498017e-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_128c': '66ee6e22-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_131': 'b0effb8f-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_127c': 'b2a27cfc-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_130n': 'a7378d52-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_126': 'Reference',\n 'tmt10_129n': '5498017e-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_128n': '66ee6e22-118a-11e9-afb9-0a9c39d33490'}", "language": "json" } ] } [/block] In this example, we used a specific `file_id` from the **CPTAC3 Discovery** project to get the metadata of interest. Now, let's try to fetch all proteome case IDs from the **CPTAC2 Confirmatory** project: [block:code] { "codes": [ { "code": "conformatory_proteome_cases = []\nfor file_id in pdc:\n if pdc[file_id]['project_name'] == 'CPTAC2 Confirmatory' and 'Proteome' in pdc[file_id]['analytical_fraction']:\n aliquots = pdc[file_id]['aliquots']\n for aliquot in aliquots:\n if aliquot['case_id'] and aliquot['case_id'] not in set(conformatory_proteome_cases):\n conformatory_proteome_cases.append(aliquot['case_id'])", "language": "python" } ] } [/block] If we print out the length of the created list: [block:code] { "codes": [ { "code": "len(conformatory_proteome_cases)", "language": "python" } ] } [/block] This returns the value of `338`, which is the same as the value obtained by querying the PDC Data Portal when the **CPTAC2 Conformatory** project is selected: [block:image] { "images": [ { "image": [ "https://files.readme.io/f9cc53b-pdc-tutorial-3.png", "pdc-tutorial-3.png", 1200, 487, "#e5e9eb" ] } ] } [/block]