{"_id":"58458c4c29c0970f00e844a8","category":{"_id":"58458b4fba4f1c0f009692bb","project":"55faf11ba62ba1170021a9a7","version":"55faf11ba62ba1170021a9aa","__v":0,"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-12-05T15:44:15.650Z","from_sync":false,"order":6,"slug":"datasets-hub","title":"DATASETS HUB"},"project":"55faf11ba62ba1170021a9a7","parentDoc":null,"__v":1,"user":"5613e4f8fdd08f2b00437620","version":{"_id":"55faf11ba62ba1170021a9aa","project":"55faf11ba62ba1170021a9a7","__v":37,"createdAt":"2015-09-17T16:58:03.490Z","releaseDate":"2015-09-17T16:58:03.490Z","categories":["55faf11ca62ba1170021a9ab","55faf8f4d0e22017005b8272","55faf91aa62ba1170021a9b5","55faf929a8a7770d00c2c0bd","55faf932a8a7770d00c2c0bf","55faf94b17b9d00d00969f47","55faf958d0e22017005b8274","55faf95fa8a7770d00c2c0c0","55faf96917b9d00d00969f48","55faf970a8a7770d00c2c0c1","55faf98c825d5f19001fa3a6","55faf99aa62ba1170021a9b8","55faf99fa62ba1170021a9b9","55faf9aa17b9d00d00969f49","55faf9b6a8a7770d00c2c0c3","55faf9bda62ba1170021a9ba","5604570090ee490d00440551","5637e8b2fbe1c50d008cb078","5649bb624fa1460d00780add","5671974d1b6b730d008b4823","5671979d60c8e70d006c9760","568e8eef70ca1f0d0035808e","56d0a2081ecc471500f1795e","56d4a0adde40c70b00823ea3","56d96b03dd90610b00270849","56fbb83d8f21c817002af880","573c811bee2b3b2200422be1","576bc92afb62dd20001cda85","5771811e27a5c20e00030dcd","5785191af3a10c0e009b75b0","57bdf84d5d48411900cd8dc0","57ff5c5dc135231700aed806","5804caf792398f0f00e77521","58458b4fba4f1c0f009692bb","586d3c287c6b5b2300c05055","58ef66d88646742f009a0216","58f5d52d7891630f00fe4e77"],"is_deprecated":false,"is_hidden":false,"is_beta":true,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"updates":["5888bf6752d5b70f004e33fb"],"next":{"pages":[],"description":""},"createdAt":"2016-12-05T15:48:28.014Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":0,"body":"[block:callout]\n{\n  \"type\": \"warning\",\n  \"title\": \"On this page:\",\n  \"body\": \"* [Overview](#overview)\\n* [The Cancer Genome Atlas (TCGA)](#tcga)\\n * [TCGA](#section-tcga) \\n * [TCGA Resources](#tcga-resources)\\n * [TCGA GRCh38](#section-tcga-grch38) \\n * [TCGA GRCh38 Resources](#tcga-grch38-resources)\\n* [Cancer Cell Line Encyclopedia (CCLE)](#ccle)\\n * [CCLE Resources](#ccle-resources)\\n* [Simons Genome Diversity Project (SGDP) dataset](#sgdp)\\n * [SGDP Resources](#sgdp-resources)\\n* [Get started](#get-started)\"\n}\n[/block]\n<a name=\"overview\"></a>\n##Overview\n\nThe CGC hosts large genomics datasets along with the tools to query, filter, and browse them. The resulting data can be added to your projects and analyzed with your private data to address your research questions.\n\nThere are two types of datasets on the CGC: \"harmonized\" and \"legacy\". This terminology is used in accordance with the GDC. In 2016, the GDC started hosting and distributing previously generated data from The Cancer Genome Atlas (TCGA). Additionally, for all submitted sequence data (FASTQs and BAM alignment files), the GDC generated new alignments (BAM files) to the latest human reference genome, GRCh38, using standard workflows. Using these alignments, the GDC generated derived data, including normal and tumor variant and mutation calls, gene and miRNA expression, and splice junction quantification data. The GDC refers to this process of data generation through standard workflows as data harmonization.\nDatasets on the CGC which are aligned to GCRh38 are labeled \"harmonized\". Datasets on the CGC which are not aligned to GRCh38 are labeled \"legacy\". However, \"legacy\" datasets remain fully supported. \n\nBelow, learn more about datasets on the CGC and access resources describing each dataset's data and metadata.\n\n<a name=\"tcga\"></a>\n##The Cancer Genome Atlas (TCGA)\n\nTCGA is one of the world’s largest cancer genomics data collections, including more than eleven thousand patients, representing 33 cancers, and over half a million total files. Data collection for TCGA began in 2006 as a joint effort by the <a href=\"https://www.cancer.gov/\" target=\"blank\">National Cancer Institute (NCI)</a>, <a href=\"https://www.genome.gov/\" target=\"blank\">National Human Genome Research Institute (NHGRI)</a>, <a href=\"https://www.nih.gov/\" target=\"blank\">the National Institutes of Health (NIH)</a>, and the <a href=\"http://www.hhs.gov/\" target=\"blank\">U.S. Department of Health and Human Services</a>. The CGC provides powerful methods to query and reproducibly analyze TCGA data by itself or in conjunction with your own data.\n\nTCGA on the CGC includes both <a href=\"https://wiki.nci.nih.gov/display/TCGA/Open+Access+and+Controlled+Access+Data\" target=\"blank\">Open and Controlled Data</a>. While all data in TCGA is stripped of direct identifiers, DNA information is inherently unique to an individual. Two types of data access ‘tiers’ (Open Data and Controlled Data) have been put in place to balance the desire to make TCGA data as widely available as possible while ensuring that the rights of study participants are well protected. You can access TCGA Open Data on the CGC as soon as you sign up and agree to data use policies. In addition, you can obtain access to Controlled Data through the NIH via the <a href=\"https://www.ncbi.nlm.nih.gov/gap\" target=\"blank\">Database of Genotypes and Phenotypes (dBGaP) site</a>.\n\nThere are two iterations of TCGA dataset on the CGC:\n  * [TCGA](#section-tcga): this is the \"legacy\" version of the dataset\n  * [TCGA GRCh38](#section-tcga-grcg38): this is the \"harmonized\" version of the dataset\n\nLearn more about their differences below.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n###TCGA\n\nTCGA is a \"legacy\" dataset which contains TCGA data from the original genome build produced by CgHub. This dataset was imported before GDC completed their harmonized data model. In contrast, the CGC hosts a harmonized version of TCGA, [TCGA GRCh38](#section-tcga-grch38), as discussed below.\n\nThe TCGA dataset is termed \"legacy\" in accordance with the GDC labeling convention because its data is not aligned to GRCh38. This legacy dataset is fully supported. Note that the metadata fields available for the legacy TCGA dataset are different from those available for [TCGA GRCh38](#section-tcga-grch38).\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"tcga-resources\"></a>\n###TCGA Resources\n* [Required permissions to access TCGA data](tcga-data-access)\n* [TCGA data](doc:tcga-data) \n* [TCGA metadata schema](doc:tcga-metadata) \n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n###TCGA GRCh38\n\nTCGA GRCh38 is a \"harmonized\" dataset which contains BAM files derived from TCGA FASTQs which have been re-aligned to GRCh38. Note that TCGA GRCh38 does not contain the FASTQs themselves. We've harmonized our TCGA data to sync with the GDC's harmonized data model so users can access the same data using similar search terms.\n \nIn contrast, the CGC also hosts a non-harmonized version of TCGA, named [TCGA](#section-tcga), discussed below. Note that the metadata fields available for TCGA GRCh38 are different from those available for the TCGA dataset available on the CGC.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n###TCGA GRCh38 Resources\n* [Required permissions to access TCGA data](tcga-data-access)\n* [TCGA GRCh38 data](doc:tcga-grch38-data) \n* [TCGA GRCh38 metadata](doc:tcga-grch38-metadata)\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n\n<a name=\"ccle\"></a>\n##Cancer Cell Line Encyclopedia (CCLE)\n\nThe Cancer Cell Line Encyclopedia (CCLE) is a project performing detailed genetic and pharmacologic characterization of a large number of human cancer cell lines. Cell lines are permanently established cell cultures derived from patients that will proliferate indefinitely given appropriate fresh medium and space. The CCLE is the result of a collaboration between the <a href=\"https://www.broadinstitute.org/\" target=\"blank\">Broad Institute</a>, the <a href=\"https://www.nibr.com/\" target=\"blank\">Novartis Institutes for Biomedical Research</a>, and the <a href=\"https://www.gnf.nibr.com/\" target=\"blank\">Genomics Institute of the Novartis Research Foundation</a>.\n\nCCLE is a \"legacy\" dataset on the CGC which contains Open Access sequencing data (in the form of reads aligned to the hg19 reference genome) for nearly 1000 cancer cell line samples. The CCLE dataset is termed \"legacy\" in accordance with the GDC labeling convention because its data is not aligned to GRCh38. This legacy dataset is fully supported. The CGC hosts the CCLE dataset in the form of a read-only public project which contains cell line samples as available from cgHub on May 11, 2016. You have automatic access to all CCLE data on the CGC.\n\n<a name=\"ccle-resources\"></a>\n###CCLE Resources\n* [CCLE data](doc:ccle-data) \n* [CCLE public project](doc:ccle) \n* [CCLE metadata schema](doc:ccle-metadata) \n\n<a name=\"sgdp\"></a>\n##Simons Genome Diversity Project (SGDP) dataset\n\nThe <a href=\"https://www.simonsfoundation.org/life-sciences/simons-genome-diversity-project-dataset/\" target=\"blank\">Simons Genome Diversity Project (SGDP) dataset</a> contains complete genome sequences from more than one hundred diverse human populations. It is the largest dataset of diverse, high quality human genome sequences ever reported. To represent as much anthropological, linguistic, and cultural diversity as possible, the dataset includes many deeply divergent human populations that are not well-represented in other datasets. \n\nSGDP is available on the CGC as a read-only public project which contains Open Access whole genome sequencing data for 279 samples. You have automatic access to all SGDP data on the CGC. Note that SGDP data is available for use in your analyses. However, it is not currently accessible via the Data Browser.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"sgdp-resources\"></a>\n##Resources\n* [SGDP data](doc:sgdp-data) \n* [Simons Genome Diversity Project (SGDP) dataset](doc:simons-genome-diversity-project-sgdp-dataset) \n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"Get-started\"></a>\n##Get started\n\n1. [Start from a broad overview](browse-datasets) of any dataset available on the CGC via the visual interface.\n2. [Refine your results with a query](query-datasets) issued on the visual interface or programmatically.\n3. [Access data](access-data-from-datasets) for further analysis in your CGC project.","excerpt":"","slug":"about-datasets","type":"basic","title":"ABOUT DATASETS"}
[block:callout] { "type": "warning", "title": "On this page:", "body": "* [Overview](#overview)\n* [The Cancer Genome Atlas (TCGA)](#tcga)\n * [TCGA](#section-tcga) \n * [TCGA Resources](#tcga-resources)\n * [TCGA GRCh38](#section-tcga-grch38) \n * [TCGA GRCh38 Resources](#tcga-grch38-resources)\n* [Cancer Cell Line Encyclopedia (CCLE)](#ccle)\n * [CCLE Resources](#ccle-resources)\n* [Simons Genome Diversity Project (SGDP) dataset](#sgdp)\n * [SGDP Resources](#sgdp-resources)\n* [Get started](#get-started)" } [/block] <a name="overview"></a> ##Overview The CGC hosts large genomics datasets along with the tools to query, filter, and browse them. The resulting data can be added to your projects and analyzed with your private data to address your research questions. There are two types of datasets on the CGC: "harmonized" and "legacy". This terminology is used in accordance with the GDC. In 2016, the GDC started hosting and distributing previously generated data from The Cancer Genome Atlas (TCGA). Additionally, for all submitted sequence data (FASTQs and BAM alignment files), the GDC generated new alignments (BAM files) to the latest human reference genome, GRCh38, using standard workflows. Using these alignments, the GDC generated derived data, including normal and tumor variant and mutation calls, gene and miRNA expression, and splice junction quantification data. The GDC refers to this process of data generation through standard workflows as data harmonization. Datasets on the CGC which are aligned to GCRh38 are labeled "harmonized". Datasets on the CGC which are not aligned to GRCh38 are labeled "legacy". However, "legacy" datasets remain fully supported. Below, learn more about datasets on the CGC and access resources describing each dataset's data and metadata. <a name="tcga"></a> ##The Cancer Genome Atlas (TCGA) TCGA is one of the world’s largest cancer genomics data collections, including more than eleven thousand patients, representing 33 cancers, and over half a million total files. Data collection for TCGA began in 2006 as a joint effort by the <a href="https://www.cancer.gov/" target="blank">National Cancer Institute (NCI)</a>, <a href="https://www.genome.gov/" target="blank">National Human Genome Research Institute (NHGRI)</a>, <a href="https://www.nih.gov/" target="blank">the National Institutes of Health (NIH)</a>, and the <a href="http://www.hhs.gov/" target="blank">U.S. Department of Health and Human Services</a>. The CGC provides powerful methods to query and reproducibly analyze TCGA data by itself or in conjunction with your own data. TCGA on the CGC includes both <a href="https://wiki.nci.nih.gov/display/TCGA/Open+Access+and+Controlled+Access+Data" target="blank">Open and Controlled Data</a>. While all data in TCGA is stripped of direct identifiers, DNA information is inherently unique to an individual. Two types of data access ‘tiers’ (Open Data and Controlled Data) have been put in place to balance the desire to make TCGA data as widely available as possible while ensuring that the rights of study participants are well protected. You can access TCGA Open Data on the CGC as soon as you sign up and agree to data use policies. In addition, you can obtain access to Controlled Data through the NIH via the <a href="https://www.ncbi.nlm.nih.gov/gap" target="blank">Database of Genotypes and Phenotypes (dBGaP) site</a>. There are two iterations of TCGA dataset on the CGC: * [TCGA](#section-tcga): this is the "legacy" version of the dataset * [TCGA GRCh38](#section-tcga-grcg38): this is the "harmonized" version of the dataset Learn more about their differences below. <div align="right"><a href="#top">top</a></div> ###TCGA TCGA is a "legacy" dataset which contains TCGA data from the original genome build produced by CgHub. This dataset was imported before GDC completed their harmonized data model. In contrast, the CGC hosts a harmonized version of TCGA, [TCGA GRCh38](#section-tcga-grch38), as discussed below. The TCGA dataset is termed "legacy" in accordance with the GDC labeling convention because its data is not aligned to GRCh38. This legacy dataset is fully supported. Note that the metadata fields available for the legacy TCGA dataset are different from those available for [TCGA GRCh38](#section-tcga-grch38). <div align="right"><a href="#top">top</a></div> <a name="tcga-resources"></a> ###TCGA Resources * [Required permissions to access TCGA data](tcga-data-access) * [TCGA data](doc:tcga-data) * [TCGA metadata schema](doc:tcga-metadata) <div align="right"><a href="#top">top</a></div> ###TCGA GRCh38 TCGA GRCh38 is a "harmonized" dataset which contains BAM files derived from TCGA FASTQs which have been re-aligned to GRCh38. Note that TCGA GRCh38 does not contain the FASTQs themselves. We've harmonized our TCGA data to sync with the GDC's harmonized data model so users can access the same data using similar search terms. In contrast, the CGC also hosts a non-harmonized version of TCGA, named [TCGA](#section-tcga), discussed below. Note that the metadata fields available for TCGA GRCh38 are different from those available for the TCGA dataset available on the CGC. <div align="right"><a href="#top">top</a></div> ###TCGA GRCh38 Resources * [Required permissions to access TCGA data](tcga-data-access) * [TCGA GRCh38 data](doc:tcga-grch38-data) * [TCGA GRCh38 metadata](doc:tcga-grch38-metadata) <div align="right"><a href="#top">top</a></div> <a name="ccle"></a> ##Cancer Cell Line Encyclopedia (CCLE) The Cancer Cell Line Encyclopedia (CCLE) is a project performing detailed genetic and pharmacologic characterization of a large number of human cancer cell lines. Cell lines are permanently established cell cultures derived from patients that will proliferate indefinitely given appropriate fresh medium and space. The CCLE is the result of a collaboration between the <a href="https://www.broadinstitute.org/" target="blank">Broad Institute</a>, the <a href="https://www.nibr.com/" target="blank">Novartis Institutes for Biomedical Research</a>, and the <a href="https://www.gnf.nibr.com/" target="blank">Genomics Institute of the Novartis Research Foundation</a>. CCLE is a "legacy" dataset on the CGC which contains Open Access sequencing data (in the form of reads aligned to the hg19 reference genome) for nearly 1000 cancer cell line samples. The CCLE dataset is termed "legacy" in accordance with the GDC labeling convention because its data is not aligned to GRCh38. This legacy dataset is fully supported. The CGC hosts the CCLE dataset in the form of a read-only public project which contains cell line samples as available from cgHub on May 11, 2016. You have automatic access to all CCLE data on the CGC. <a name="ccle-resources"></a> ###CCLE Resources * [CCLE data](doc:ccle-data) * [CCLE public project](doc:ccle) * [CCLE metadata schema](doc:ccle-metadata) <a name="sgdp"></a> ##Simons Genome Diversity Project (SGDP) dataset The <a href="https://www.simonsfoundation.org/life-sciences/simons-genome-diversity-project-dataset/" target="blank">Simons Genome Diversity Project (SGDP) dataset</a> contains complete genome sequences from more than one hundred diverse human populations. It is the largest dataset of diverse, high quality human genome sequences ever reported. To represent as much anthropological, linguistic, and cultural diversity as possible, the dataset includes many deeply divergent human populations that are not well-represented in other datasets. SGDP is available on the CGC as a read-only public project which contains Open Access whole genome sequencing data for 279 samples. You have automatic access to all SGDP data on the CGC. Note that SGDP data is available for use in your analyses. However, it is not currently accessible via the Data Browser. <div align="right"><a href="#top">top</a></div> <a name="sgdp-resources"></a> ##Resources * [SGDP data](doc:sgdp-data) * [Simons Genome Diversity Project (SGDP) dataset](doc:simons-genome-diversity-project-sgdp-dataset) <div align="right"><a href="#top">top</a></div> <a name="Get-started"></a> ##Get started 1. [Start from a broad overview](browse-datasets) of any dataset available on the CGC via the visual interface. 2. [Refine your results with a query](query-datasets) issued on the visual interface or programmatically. 3. [Access data](access-data-from-datasets) for further analysis in your CGC project.