{"_id":"58f5d5567891630f00fe4e78","version":{"_id":"55faf11ba62ba1170021a9aa","project":"55faf11ba62ba1170021a9a7","__v":38,"createdAt":"2015-09-17T16:58:03.490Z","releaseDate":"2015-09-17T16:58:03.490Z","categories":["55faf11ca62ba1170021a9ab","55faf8f4d0e22017005b8272","55faf91aa62ba1170021a9b5","55faf929a8a7770d00c2c0bd","55faf932a8a7770d00c2c0bf","55faf94b17b9d00d00969f47","55faf958d0e22017005b8274","55faf95fa8a7770d00c2c0c0","55faf96917b9d00d00969f48","55faf970a8a7770d00c2c0c1","55faf98c825d5f19001fa3a6","55faf99aa62ba1170021a9b8","55faf99fa62ba1170021a9b9","55faf9aa17b9d00d00969f49","55faf9b6a8a7770d00c2c0c3","55faf9bda62ba1170021a9ba","5604570090ee490d00440551","5637e8b2fbe1c50d008cb078","5649bb624fa1460d00780add","5671974d1b6b730d008b4823","5671979d60c8e70d006c9760","568e8eef70ca1f0d0035808e","56d0a2081ecc471500f1795e","56d4a0adde40c70b00823ea3","56d96b03dd90610b00270849","56fbb83d8f21c817002af880","573c811bee2b3b2200422be1","576bc92afb62dd20001cda85","5771811e27a5c20e00030dcd","5785191af3a10c0e009b75b0","57bdf84d5d48411900cd8dc0","57ff5c5dc135231700aed806","5804caf792398f0f00e77521","58458b4fba4f1c0f009692bb","586d3c287c6b5b2300c05055","58ef66d88646742f009a0216","58f5d52d7891630f00fe4e77","59a555bccdbd85001bfb1442"],"is_deprecated":false,"is_hidden":false,"is_beta":true,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"parentDoc":null,"user":"5767bc73bb15f40e00a28777","category":{"_id":"58f5d52d7891630f00fe4e77","project":"55faf11ba62ba1170021a9a7","version":"55faf11ba62ba1170021a9aa","__v":0,"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2017-04-18T08:58:21.978Z","from_sync":false,"order":31,"slug":"data-cruncher","title":"DATA CRUNCHER"},"project":"55faf11ba62ba1170021a9a7","__v":0,"updates":[],"next":{"pages":[],"description":""},"createdAt":"2017-04-18T08:59:02.245Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"settings":"","results":{"codes":[]},"auth":"required","params":[],"url":""},"isReference":false,"order":1,"body":"## Overview\n\nData Cruncher is an interactive analysis tool on the CGC for exploring and mining data using Jupyter notebooks.\n\n## Objective\n\nThe aim of this guide is to show you how to create and run your first analysis in Data Cruncher, using a real-life example of filtering VCF files based on the alternative allele read depth (AD) and the read depth (DP) ratio.\n\n## Procedure\n\n* **[ 1 ]** Access Data Cruncher from a project on the Platform.\n* **[ 2 ]** Create and set up an analysis.\n* **[ 3 ]** Enter and execute code within the analysis to get results.\n\n## Prerequisites\n* You need to download this [sample generic VCF file](https://igor.sbgenomics.com/staticfiles/TEST.bam.vcf) and [upload it to the project on the CGC](doc:upload-to-the-cgc) in which you want to execute the analysis.\n* You need execute permissions in the project to be able to run the analysis.\n\n### [ 1 ] Access Data Cruncher\n\n1. Open the project on the CGC that contains the uploaded VCF file.\n2. From the project's dashboard, click the **Interactive Analysis** tab.\nThe list of available interactive analysis tools opens. \n3. On the **Data Cruncher** card click **Open**.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/0739319-cruncher_card.png\",\n        \"cruncher_card.png\",\n        293,\n        441,\n        \"#eeebec\"\n      ]\n    }\n  ]\n}\n[/block]\nThis takes you to the Data Cruncher home page.\n\n### [ 2 ] Create and set up your first analysis\nIf there are no previous analyses, the main Data Cruncher screen will be blank.\n1. Click **Create your first analysis**.\nThe **Create new analysis wizard** is displayed.\n2. On the first screen, enter **VCF Filtering** in the **Analysis name** field.\n3. Click **Next**.\n4. Select the instance for the analysis.\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/efb167a-cruncher_quickstart_1.png\",\n        \"cruncher_quickstart_1.png\",\n        572,\n        418,\n        \"#ecedec\"\n      ]\n    }\n  ]\n}\n[/block]\nThe list displays the instances along with their disk size, number of vCPUs and memory (shown in brackets). The default instance is **c3.2xlarge** that has **160 GB** of SSD storage, **8 vCPUs** and **15 GB** of RAM. \n[block:callout]\n{\n  \"type\": \"info\",\n  \"body\": \"Any instance will be stopped after 30 minutes of inactivity within the analysis that is running on the instance. This also includes stopping the analysis and saving all files that meet the automatic saving criteria or have been selected to be saved as project files. Files that do not meet the criteria and are not manually saved to the project will be lost.\"\n}\n[/block]\n5. Click **Next**.\n6. Define the automatic saving criteria:\n * **Ignore the following file types** - Files that have the listed extensions will never be automatically saved when the analysis is stopped. We will keep the suggested ignored file types, `.log, .zip`.\n * **Ignore files larger than** - Files bigger than the specified size will not be automatically saved when the analysis is stopped.\n7. Click **Start the analysis**.\nThe CGC will start acquiring an adequate instance for your analysis, which may take a few minutes. Once an instance is ready, you will be notified.\n\n### [ 3 ] Start the analysis\nOnce the CGC has acquired an instance for your analysis, you can open the editor and run your analysis.\n\n1. Click **Open in editor**.\nThe editor opens in a new window.\n2. On the welcome dialog, click **Notebook**.\n3. Enter the notebook details:\n * **File Name** - Name the notebook **VCF_Filtering**.\n * **Kernel** - This is the “computational engine” that executes the code contained in a notebook. Select **Python 2**.\n4. Click **Create**.\nYour notebook is now ready. You can start entering the code and the additional text.\n5. In the first cell paste the following code:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"import pandas as pd\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n6. Click <img src=\"https://files.readme.io/cdb570a-run_trimmed.png\"\nwidth=\"auto\" align=\"inline\" style=\"margin:1px\"/> on the toolbar. This executes the current cell and creates a new one below.\n7. Click **Code** on the toolbar and select **Markdown** from the dropdown list. This changes the cell type to Markdown, so we can add a title.\n8. Paste the following text into the cell:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# Constants and Functions\",\n      \"language\": \"markdown\"\n    }\n  ]\n}\n[/block]\n9. Click <img src=\"https://files.readme.io/cdb570a-run_trimmed.png\"\nwidth=\"auto\" align=\"inline\" style=\"margin:1px\"/>.\n10. Paste the following code into the next cell:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"vcf_column_names = [\\\"CHROM\\\", \\\"POS\\\", \\\"ID\\\", \\\"REF\\\", \\\"ALT\\\", \\\"QUAL\\\", \\\"FILTER\\\", \\\"INFO\\\", \\\"FORMAT\\\", \\\"NORMAL\\\", \\\"TUMOR\\\"]\\n \\ndef read_vcf_df(vcf_path, vcf_column_names):\\n    return pd.read_csv(vcf_path, comment=\\\"#\\\", header=None, names = vcf_column_names , sep=\\\"\\\\t\\\")\\n \\ndef ad_dp_calc(x):       \\n    ref = x[0]\\n    alts = x[1].split(',')\\n    gt = x[2].split(':')\\n    if len(gt) == 1:\\n        return 0\\n    dp = float(gt[0].replace(\\\",\\\",\\\".\\\"))\\n    a = float(gt[4].replace(\\\",\\\",\\\".\\\"))\\n    c = float(gt[5].replace(\\\",\\\",\\\".\\\"))\\n    g = float(gt[6].replace(\\\",\\\",\\\".\\\"))\\n    t = float(gt[7].replace(\\\",\\\",\\\".\\\"))\\n    ad = 0\\n    for alt in alts:\\n        if alt == \\\"A\\\":\\n            ad += a\\n        elif alt == \\\"C\\\":\\n            ad += c\\n        elif alt == \\\"G\\\":\\n            ad += g\\n        elif alt == \\\"T\\\":\\n            ad += t\\n    return float(ad) / (dp + len(alts))\\n \\ndef read_vcf_header(file_name):\\n    header = \\\"\\\"\\n    with open(file_name) as f:\\n        for line in f:\\n            if line[0] == \\\"#\\\":\\n                header += line\\n            else:\\n                break\\n    return header\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n11. Click <img src=\"https://files.readme.io/cdb570a-run_trimmed.png\"\nwidth=\"auto\" align=\"inline\" style=\"margin:1px\"/>.\n12. Change the type of the blank cell to markdown and paste the following text:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# AD/DP VCF Filtering Description\\n \\nWe are **filtering VCF files** based on the alternative allele read depth (**AD**) and the read depth (**DP**) ratio. We are discarding all the variants that don't pass the criteria of **AD/DP > 0.15**.\",\n      \"language\": \"markdown\"\n    }\n  ]\n}\n[/block]\n13. Click <img src=\"https://files.readme.io/cdb570a-run_trimmed.png\"\nwidth=\"auto\" align=\"inline\" style=\"margin:1px\"/>.\n14. Change the type of the blank cell to markdown and paste the following text:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# VCF Filtering\",\n      \"language\": \"markdown\"\n    }\n  ]\n}\n[/block]\n15. Click <img src=\"https://files.readme.io/cdb570a-run_trimmed.png\"\nwidth=\"auto\" align=\"inline\" style=\"margin:1px\"/>.\n16. Paste the following code:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"vcf_name = \\\"PATH_TO_VCF_FILE\\\"\\nvcf = read_vcf_df(vcf_name, vcf_column_names)\\ntitle_name = \\\"Distribution of AD/DP Ratios for FPs SNVs GRAL\\\"\\nvcf[\\\"AD_DP\\\"] = vcf[[\\\"REF\\\",\\\"ALT\\\",\\\"TUMOR\\\"]].apply(lambda x: ad_dp_calc(x), axis=1)\\nvcf = vcf[vcf[\\\"AD_DP\\\"] >= 0.15 ]\\ndel vcf[\\\"AD_DP\\\"]\\nheader = read_vcf_header(vcf_name)\\nwith open(\\\"REALIGNED_TUMOR.bam.filtered.vcf\\\",\\\"w\\\") as f:\\n    f.write(header)\\n    vcf.to_csv(f, sep=\\\"\\\\t\\\", header=False, index=False )\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nOn line 1 above, there is a placeholder named `PATH_TO_VCF_FILE` that we need to replace with the actual path to the uploaded file:\n\n a. Open the **Project Files** tab.\n b. Click **TEST.bam.vcf**. This will copy the file path to the clipboard.\n c. Paste the copied path instead of `PATH_TO_VCF_FILE` on line 1 of the code block above, making sure to keep the quotation marks.\n17. Click <img src=\"https://files.readme.io/cdb570a-run_trimmed.png\"\nwidth=\"auto\" align=\"inline\" style=\"margin:1px\"/>.\nThe analysis now executes. A new file named `REALIGNED_TUMOR.bam.filtered.vcf` is created and you can see it under the **Files** tab in the analysis editor.","excerpt":"","slug":"data-cruncher-quickstart","type":"basic","title":"Data Cruncher Quickstart"}

Data Cruncher Quickstart


## Overview Data Cruncher is an interactive analysis tool on the CGC for exploring and mining data using Jupyter notebooks. ## Objective The aim of this guide is to show you how to create and run your first analysis in Data Cruncher, using a real-life example of filtering VCF files based on the alternative allele read depth (AD) and the read depth (DP) ratio. ## Procedure * **[ 1 ]** Access Data Cruncher from a project on the Platform. * **[ 2 ]** Create and set up an analysis. * **[ 3 ]** Enter and execute code within the analysis to get results. ## Prerequisites * You need to download this [sample generic VCF file](https://igor.sbgenomics.com/staticfiles/TEST.bam.vcf) and [upload it to the project on the CGC](doc:upload-to-the-cgc) in which you want to execute the analysis. * You need execute permissions in the project to be able to run the analysis. ### [ 1 ] Access Data Cruncher 1. Open the project on the CGC that contains the uploaded VCF file. 2. From the project's dashboard, click the **Interactive Analysis** tab. The list of available interactive analysis tools opens. 3. On the **Data Cruncher** card click **Open**. [block:image] { "images": [ { "image": [ "https://files.readme.io/0739319-cruncher_card.png", "cruncher_card.png", 293, 441, "#eeebec" ] } ] } [/block] This takes you to the Data Cruncher home page. ### [ 2 ] Create and set up your first analysis If there are no previous analyses, the main Data Cruncher screen will be blank. 1. Click **Create your first analysis**. The **Create new analysis wizard** is displayed. 2. On the first screen, enter **VCF Filtering** in the **Analysis name** field. 3. Click **Next**. 4. Select the instance for the analysis. [block:image] { "images": [ { "image": [ "https://files.readme.io/efb167a-cruncher_quickstart_1.png", "cruncher_quickstart_1.png", 572, 418, "#ecedec" ] } ] } [/block] The list displays the instances along with their disk size, number of vCPUs and memory (shown in brackets). The default instance is **c3.2xlarge** that has **160 GB** of SSD storage, **8 vCPUs** and **15 GB** of RAM. [block:callout] { "type": "info", "body": "Any instance will be stopped after 30 minutes of inactivity within the analysis that is running on the instance. This also includes stopping the analysis and saving all files that meet the automatic saving criteria or have been selected to be saved as project files. Files that do not meet the criteria and are not manually saved to the project will be lost." } [/block] 5. Click **Next**. 6. Define the automatic saving criteria: * **Ignore the following file types** - Files that have the listed extensions will never be automatically saved when the analysis is stopped. We will keep the suggested ignored file types, `.log, .zip`. * **Ignore files larger than** - Files bigger than the specified size will not be automatically saved when the analysis is stopped. 7. Click **Start the analysis**. The CGC will start acquiring an adequate instance for your analysis, which may take a few minutes. Once an instance is ready, you will be notified. ### [ 3 ] Start the analysis Once the CGC has acquired an instance for your analysis, you can open the editor and run your analysis. 1. Click **Open in editor**. The editor opens in a new window. 2. On the welcome dialog, click **Notebook**. 3. Enter the notebook details: * **File Name** - Name the notebook **VCF_Filtering**. * **Kernel** - This is the “computational engine” that executes the code contained in a notebook. Select **Python 2**. 4. Click **Create**. Your notebook is now ready. You can start entering the code and the additional text. 5. In the first cell paste the following code: [block:code] { "codes": [ { "code": "import pandas as pd", "language": "python" } ] } [/block] 6. Click <img src="https://files.readme.io/cdb570a-run_trimmed.png" width="auto" align="inline" style="margin:1px"/> on the toolbar. This executes the current cell and creates a new one below. 7. Click **Code** on the toolbar and select **Markdown** from the dropdown list. This changes the cell type to Markdown, so we can add a title. 8. Paste the following text into the cell: [block:code] { "codes": [ { "code": "# Constants and Functions", "language": "markdown" } ] } [/block] 9. Click <img src="https://files.readme.io/cdb570a-run_trimmed.png" width="auto" align="inline" style="margin:1px"/>. 10. Paste the following code into the next cell: [block:code] { "codes": [ { "code": "vcf_column_names = [\"CHROM\", \"POS\", \"ID\", \"REF\", \"ALT\", \"QUAL\", \"FILTER\", \"INFO\", \"FORMAT\", \"NORMAL\", \"TUMOR\"]\n \ndef read_vcf_df(vcf_path, vcf_column_names):\n return pd.read_csv(vcf_path, comment=\"#\", header=None, names = vcf_column_names , sep=\"\\t\")\n \ndef ad_dp_calc(x): \n ref = x[0]\n alts = x[1].split(',')\n gt = x[2].split(':')\n if len(gt) == 1:\n return 0\n dp = float(gt[0].replace(\",\",\".\"))\n a = float(gt[4].replace(\",\",\".\"))\n c = float(gt[5].replace(\",\",\".\"))\n g = float(gt[6].replace(\",\",\".\"))\n t = float(gt[7].replace(\",\",\".\"))\n ad = 0\n for alt in alts:\n if alt == \"A\":\n ad += a\n elif alt == \"C\":\n ad += c\n elif alt == \"G\":\n ad += g\n elif alt == \"T\":\n ad += t\n return float(ad) / (dp + len(alts))\n \ndef read_vcf_header(file_name):\n header = \"\"\n with open(file_name) as f:\n for line in f:\n if line[0] == \"#\":\n header += line\n else:\n break\n return header", "language": "python" } ] } [/block] 11. Click <img src="https://files.readme.io/cdb570a-run_trimmed.png" width="auto" align="inline" style="margin:1px"/>. 12. Change the type of the blank cell to markdown and paste the following text: [block:code] { "codes": [ { "code": "# AD/DP VCF Filtering Description\n \nWe are **filtering VCF files** based on the alternative allele read depth (**AD**) and the read depth (**DP**) ratio. We are discarding all the variants that don't pass the criteria of **AD/DP > 0.15**.", "language": "markdown" } ] } [/block] 13. Click <img src="https://files.readme.io/cdb570a-run_trimmed.png" width="auto" align="inline" style="margin:1px"/>. 14. Change the type of the blank cell to markdown and paste the following text: [block:code] { "codes": [ { "code": "# VCF Filtering", "language": "markdown" } ] } [/block] 15. Click <img src="https://files.readme.io/cdb570a-run_trimmed.png" width="auto" align="inline" style="margin:1px"/>. 16. Paste the following code: [block:code] { "codes": [ { "code": "vcf_name = \"PATH_TO_VCF_FILE\"\nvcf = read_vcf_df(vcf_name, vcf_column_names)\ntitle_name = \"Distribution of AD/DP Ratios for FPs SNVs GRAL\"\nvcf[\"AD_DP\"] = vcf[[\"REF\",\"ALT\",\"TUMOR\"]].apply(lambda x: ad_dp_calc(x), axis=1)\nvcf = vcf[vcf[\"AD_DP\"] >= 0.15 ]\ndel vcf[\"AD_DP\"]\nheader = read_vcf_header(vcf_name)\nwith open(\"REALIGNED_TUMOR.bam.filtered.vcf\",\"w\") as f:\n f.write(header)\n vcf.to_csv(f, sep=\"\\t\", header=False, index=False )", "language": "python" } ] } [/block] On line 1 above, there is a placeholder named `PATH_TO_VCF_FILE` that we need to replace with the actual path to the uploaded file: a. Open the **Project Files** tab. b. Click **TEST.bam.vcf**. This will copy the file path to the clipboard. c. Paste the copied path instead of `PATH_TO_VCF_FILE` on line 1 of the code block above, making sure to keep the quotation marks. 17. Click <img src="https://files.readme.io/cdb570a-run_trimmed.png" width="auto" align="inline" style="margin:1px"/>. The analysis now executes. A new file named `REALIGNED_TUMOR.bam.filtered.vcf` is created and you can see it under the **Files** tab in the analysis editor.