{"_id":"5d4c1907c2f6510047fc859b","project":"55faf11ba62ba1170021a9a7","version":{"_id":"55faf11ba62ba1170021a9aa","project":"55faf11ba62ba1170021a9a7","__v":45,"createdAt":"2015-09-17T16:58:03.490Z","releaseDate":"2015-09-17T16:58:03.490Z","categories":["55faf11ca62ba1170021a9ab","55faf8f4d0e22017005b8272","55faf91aa62ba1170021a9b5","55faf929a8a7770d00c2c0bd","55faf932a8a7770d00c2c0bf","55faf94b17b9d00d00969f47","55faf958d0e22017005b8274","55faf95fa8a7770d00c2c0c0","55faf96917b9d00d00969f48","55faf970a8a7770d00c2c0c1","55faf98c825d5f19001fa3a6","55faf99aa62ba1170021a9b8","55faf99fa62ba1170021a9b9","55faf9aa17b9d00d00969f49","55faf9b6a8a7770d00c2c0c3","55faf9bda62ba1170021a9ba","5604570090ee490d00440551","5637e8b2fbe1c50d008cb078","5649bb624fa1460d00780add","5671974d1b6b730d008b4823","5671979d60c8e70d006c9760","568e8eef70ca1f0d0035808e","56d0a2081ecc471500f1795e","56d4a0adde40c70b00823ea3","56d96b03dd90610b00270849","56fbb83d8f21c817002af880","573c811bee2b3b2200422be1","576bc92afb62dd20001cda85","5771811e27a5c20e00030dcd","5785191af3a10c0e009b75b0","57bdf84d5d48411900cd8dc0","57ff5c5dc135231700aed806","5804caf792398f0f00e77521","58458b4fba4f1c0f009692bb","586d3c287c6b5b2300c05055","58ef66d88646742f009a0216","58f5d52d7891630f00fe4e77","59a555bccdbd85001bfb1442","5a2a81f688574d001e9934f5","5b080c8d7833b20003ddbb6f","5c222bed4bc358002f21459a","5c22412594a2a5005cc9e919","5c41ae1c33592700190a291e","5c8a525e2ba7b2003f9b153c","5cbf14d58c79c700ef2b502e"],"is_deprecated":false,"is_hidden":false,"is_beta":true,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"category":{"_id":"56fbb83d8f21c817002af880","version":"55faf11ba62ba1170021a9aa","__v":0,"project":"55faf11ba62ba1170021a9a7","sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-03-30T11:27:57.862Z","from_sync":false,"order":1,"slug":"tutorials","title":"TUTORIALS"},"user":"5767bc73bb15f40e00a28777","__v":0,"updates":[],"next":{"pages":[],"description":""},"createdAt":"2019-08-08T12:43:51.105Z","link_external":false,"link_url":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":999,"body":"[block:callout]\n{\n  \"type\": \"info\",\n  \"title\": \"\",\n  \"body\": \"This tutorial provides an example of how to fetch metadata for data imported from the PDC. Learn how to [import data from the PDC](doc:import-from-the-pdc).\"\n}\n[/block]\n\n### Introduction\n\nThis short tutorial will show how you can use Data Cruncher and the `PDC_CPTAC3_metadata.json` file from the Public Reference Files gallery to get the desired metadata for proteomic data [imported from the PDC](doc:import-from-the-pdc). The metadata file contains the set of metadata for data available on the PDC and its purpose is to provide direct availability of PDC CPTAC3 metadata, without having to do additional downloading, conversion and uploading to bring the metadata to the CGC. The use case shown in the tutorial is for demonstration purposes, you can customize the code to retrieve metadata of you choice.\n\n### Prerequisites\n\n* An active account on the CGC.\n\n### Steps\n1. [Create a project on the CGC](#section-1-create-a-project-on-the-cgc).\n2. [Add the metadata file to your project](#section-2-add-the-metadata-file-to-your-project).\n3. [Use Data Cruncher to extract metadata from the file](#section-3-use-data-cruncher-to-extract-metadata-from-the-file).\n\n### 1. Create a project on the CGC\n1. On the CGC home page, click **Projects**.\n2. In the bottom-right corner of the dropdown menu click **+ Create a project**. Project creation dialog opens.\n3. Name the project **PDC Metadata**.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/9f0aae6-pdc-tutorial-1.png\",\n        \"pdc-tutorial-1.png\",\n        477,\n        482,\n        \"#e1e9ea\"\n      ]\n    }\n  ]\n}\n[/block]\n\n4. Keep the predefined values for other settings. If you want to learn more about project settings, see [more details](doc:modify-project-settings).\n5. Click **Create**. The project is now created and you are taken to the [project dashboard](doc:manage-the-project-dashboard).\n\nThe next step is to add the metadata file to the project.\n\n### 2. Add the metadata file to your project\n1. While in the **PDC Metadata** project, click the **Files** tab.\n2. Click **+ Add files**.\n3. In the search box, enter `PDC_CPTAC3_metadata.json`. The file is displayed in the file list.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/727b92d-pdc-tutorial-2.png\",\n        \"pdc-tutorial-2.png\",\n        492,\n        331,\n        \"#eeeeec\"\n      ]\n    }\n  ]\n}\n[/block]\n4. Select the checkbox next to the file.\n5. Click **Copy to project** in the top-right corner. \n6. (Optional) Add/remove [file tags](doc:tag-your-files).\n7. Click **Copy**. The file is now available in your project.\n\n### 3. Use Data Cruncher to extract metadata from the file\n\nData Cruncher is an interactive environment within the CGC that allows you perform further analyses of your data using JupyterLab or RStudio. In this tutorial, we will be using Python code inside a Jupyter notebook to fetch metadata from the PDC CPTAC3 metadata file that is already available on the CGC.\n1. In the **PDC Metadata** project, click the **Interactive analysis** tab on the right.\n2. On the **Data Cruncher** card click **Open**. You are taken to the Data Cruncher.\n3. In the top-right corner, click **Create new analysis**. If you don't have any previous analyses, you will see the **Create your first analysis button** in the center of the screen.\n4. In the **Analysis name** field, enter **PDC CPTAC3 Metadata**. \n5. Select **JupyterLab** as the analysis environment.\n6. Click **Next**.\n7. Keep the predefined values for **Compute requirements** and click **Start the analysis**. The CGC will start acquiring an adequate instance for your analysis, which may take a few minutes. Once the analysis is ready, you will be notified.\n8. Run your analysis by clicking **Open in editor** in the top-right corner. JupyterLab opens.\n9. In the **Notebook** section on the JupyterLab home screen, select **Python 3**. You can now start entering the analysis code in the cells. Please note that each time a block of code is entered, it needs to be executed by clicking  <i class=\"fa fa-play\" aria-hidden=\"true\"></i>  or by pressing Shift + Enter on the keyboard.\n10. Let's begin by importing the JSON module in Python:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"import json\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n11. We will now load the metadata file:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"with open('/sbgenomics/project-files/PDC_CPTAC3_metadata.json') as cptac_json:\\n    cptac = json.load(cptac_json)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n12. Let's assume that we have analyzed a specific raw file for which we know the file ID and we want to associate aliquot labels with corresponding case IDs:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"file_id = '44beb078-f945-11e8-953d-005056921935'\\n\\nlabel_map = {}\\naliquots = cptac[file_id]['aliquots']\\nfor aliquot in aliquots:\\n    label = aliquot['label']\\n    if 'sample' in aliquot:\\n        case_id = aliquot['sample']['case']['case_id']\\n        label_map[label] = case_id\\n    else:\\n        label_map[label] = 'Reference'\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n13. We get a dictionary that contains the TMT10 to Case ID mapping. Let's print out the dictionary:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"label_map\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nHere's what the created dictionary looks like:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"{'tmt10_130c': '619df278-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_127n': 'b2a27cfc-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_129c': '5498017e-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_128c': '66ee6e22-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_131': 'b0effb8f-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_127c': 'b2a27cfc-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_130n': 'a7378d52-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_126': 'Reference',\\n 'tmt10_129n': '5498017e-118a-11e9-afb9-0a9c39d33490',\\n 'tmt10_128n': '66ee6e22-118a-11e9-afb9-0a9c39d33490'}\",\n      \"language\": \"json\"\n    }\n  ]\n}\n[/block]\nThis was a basic example of how to easily retrieve and associate metadata related to data downloaded from the PDC. In order to have better understanding of the structure of the metadata file and be able to gather other information, you can print out the whole object for this file ID:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"cptac[file_id]\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nAs you can see, there are multiple levels of nested objects:\n\n* The first level contains some general information about _**File**_ such as `file_name`, `project_name` etc.\n* The second level contains _**Aliquot**_ information (`aliquot_id`, `label`,...)\n* _**Sample**_ information is contained under aliquot information (`sample_id`, `sample_submitter_id`,...) – third level\n* _**Case**_ information is the fourth level (`case_id`, `disease_type`,...)\n* Finally, _**Demographics**_ and _**Diagnoses**_ information is the fifth level as those are under _Case_","excerpt":"","slug":"fetch-metadata-from-the-pdc-cptac3-metadata-file","type":"basic","title":"Fetch metadata from the PDC CPTAC3 metadata file"}

Fetch metadata from the PDC CPTAC3 metadata file


[block:callout] { "type": "info", "title": "", "body": "This tutorial provides an example of how to fetch metadata for data imported from the PDC. Learn how to [import data from the PDC](doc:import-from-the-pdc)." } [/block] ### Introduction This short tutorial will show how you can use Data Cruncher and the `PDC_CPTAC3_metadata.json` file from the Public Reference Files gallery to get the desired metadata for proteomic data [imported from the PDC](doc:import-from-the-pdc). The metadata file contains the set of metadata for data available on the PDC and its purpose is to provide direct availability of PDC CPTAC3 metadata, without having to do additional downloading, conversion and uploading to bring the metadata to the CGC. The use case shown in the tutorial is for demonstration purposes, you can customize the code to retrieve metadata of you choice. ### Prerequisites * An active account on the CGC. ### Steps 1. [Create a project on the CGC](#section-1-create-a-project-on-the-cgc). 2. [Add the metadata file to your project](#section-2-add-the-metadata-file-to-your-project). 3. [Use Data Cruncher to extract metadata from the file](#section-3-use-data-cruncher-to-extract-metadata-from-the-file). ### 1. Create a project on the CGC 1. On the CGC home page, click **Projects**. 2. In the bottom-right corner of the dropdown menu click **+ Create a project**. Project creation dialog opens. 3. Name the project **PDC Metadata**. [block:image] { "images": [ { "image": [ "https://files.readme.io/9f0aae6-pdc-tutorial-1.png", "pdc-tutorial-1.png", 477, 482, "#e1e9ea" ] } ] } [/block] 4. Keep the predefined values for other settings. If you want to learn more about project settings, see [more details](doc:modify-project-settings). 5. Click **Create**. The project is now created and you are taken to the [project dashboard](doc:manage-the-project-dashboard). The next step is to add the metadata file to the project. ### 2. Add the metadata file to your project 1. While in the **PDC Metadata** project, click the **Files** tab. 2. Click **+ Add files**. 3. In the search box, enter `PDC_CPTAC3_metadata.json`. The file is displayed in the file list. [block:image] { "images": [ { "image": [ "https://files.readme.io/727b92d-pdc-tutorial-2.png", "pdc-tutorial-2.png", 492, 331, "#eeeeec" ] } ] } [/block] 4. Select the checkbox next to the file. 5. Click **Copy to project** in the top-right corner.  6. (Optional) Add/remove [file tags](doc:tag-your-files). 7. Click **Copy**. The file is now available in your project. ### 3. Use Data Cruncher to extract metadata from the file Data Cruncher is an interactive environment within the CGC that allows you perform further analyses of your data using JupyterLab or RStudio. In this tutorial, we will be using Python code inside a Jupyter notebook to fetch metadata from the PDC CPTAC3 metadata file that is already available on the CGC. 1. In the **PDC Metadata** project, click the **Interactive analysis** tab on the right. 2. On the **Data Cruncher** card click **Open**. You are taken to the Data Cruncher. 3. In the top-right corner, click **Create new analysis**. If you don't have any previous analyses, you will see the **Create your first analysis button** in the center of the screen. 4. In the **Analysis name** field, enter **PDC CPTAC3 Metadata**.  5. Select **JupyterLab** as the analysis environment. 6. Click **Next**. 7. Keep the predefined values for **Compute requirements** and click **Start the analysis**. The CGC will start acquiring an adequate instance for your analysis, which may take a few minutes. Once the analysis is ready, you will be notified. 8. Run your analysis by clicking **Open in editor** in the top-right corner. JupyterLab opens. 9. In the **Notebook** section on the JupyterLab home screen, select **Python 3**. You can now start entering the analysis code in the cells. Please note that each time a block of code is entered, it needs to be executed by clicking  <i class="fa fa-play" aria-hidden="true"></i> or by pressing Shift + Enter on the keyboard. 10. Let's begin by importing the JSON module in Python: [block:code] { "codes": [ { "code": "import json", "language": "python" } ] } [/block] 11. We will now load the metadata file: [block:code] { "codes": [ { "code": "with open('/sbgenomics/project-files/PDC_CPTAC3_metadata.json') as cptac_json:\n cptac = json.load(cptac_json)", "language": "python" } ] } [/block] 12. Let's assume that we have analyzed a specific raw file for which we know the file ID and we want to associate aliquot labels with corresponding case IDs: [block:code] { "codes": [ { "code": "file_id = '44beb078-f945-11e8-953d-005056921935'\n\nlabel_map = {}\naliquots = cptac[file_id]['aliquots']\nfor aliquot in aliquots:\n label = aliquot['label']\n if 'sample' in aliquot:\n case_id = aliquot['sample']['case']['case_id']\n label_map[label] = case_id\n else:\n label_map[label] = 'Reference'", "language": "python" } ] } [/block] 13. We get a dictionary that contains the TMT10 to Case ID mapping. Let's print out the dictionary: [block:code] { "codes": [ { "code": "label_map", "language": "python" } ] } [/block] Here's what the created dictionary looks like: [block:code] { "codes": [ { "code": "{'tmt10_130c': '619df278-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_127n': 'b2a27cfc-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_129c': '5498017e-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_128c': '66ee6e22-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_131': 'b0effb8f-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_127c': 'b2a27cfc-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_130n': 'a7378d52-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_126': 'Reference',\n 'tmt10_129n': '5498017e-118a-11e9-afb9-0a9c39d33490',\n 'tmt10_128n': '66ee6e22-118a-11e9-afb9-0a9c39d33490'}", "language": "json" } ] } [/block] This was a basic example of how to easily retrieve and associate metadata related to data downloaded from the PDC. In order to have better understanding of the structure of the metadata file and be able to gather other information, you can print out the whole object for this file ID: [block:code] { "codes": [ { "code": "cptac[file_id]", "language": "python" } ] } [/block] As you can see, there are multiple levels of nested objects: * The first level contains some general information about _**File**_ such as `file_name`, `project_name` etc. * The second level contains _**Aliquot**_ information (`aliquot_id`, `label`,...) * _**Sample**_ information is contained under aliquot information (`sample_id`, `sample_submitter_id`,...) – third level * _**Case**_ information is the fourth level (`case_id`, `disease_type`,...) * Finally, _**Demographics**_ and _**Diagnoses**_ information is the fifth level as those are under _Case_