{"_id":"58458c4c29c0970f00e844a8","category":{"_id":"58458b4fba4f1c0f009692bb","project":"55faf11ba62ba1170021a9a7","version":"55faf11ba62ba1170021a9aa","__v":0,"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-12-05T15:44:15.650Z","from_sync":false,"order":6,"slug":"datasets-hub","title":"DATASETS HUB"},"project":"55faf11ba62ba1170021a9a7","githubsync":"","parentDoc":null,"__v":5,"user":"5613e4f8fdd08f2b00437620","version":{"_id":"55faf11ba62ba1170021a9aa","project":"55faf11ba62ba1170021a9a7","__v":40,"createdAt":"2015-09-17T16:58:03.490Z","releaseDate":"2015-09-17T16:58:03.490Z","categories":["55faf11ca62ba1170021a9ab","55faf8f4d0e22017005b8272","55faf91aa62ba1170021a9b5","55faf929a8a7770d00c2c0bd","55faf932a8a7770d00c2c0bf","55faf94b17b9d00d00969f47","55faf958d0e22017005b8274","55faf95fa8a7770d00c2c0c0","55faf96917b9d00d00969f48","55faf970a8a7770d00c2c0c1","55faf98c825d5f19001fa3a6","55faf99aa62ba1170021a9b8","55faf99fa62ba1170021a9b9","55faf9aa17b9d00d00969f49","55faf9b6a8a7770d00c2c0c3","55faf9bda62ba1170021a9ba","5604570090ee490d00440551","5637e8b2fbe1c50d008cb078","5649bb624fa1460d00780add","5671974d1b6b730d008b4823","5671979d60c8e70d006c9760","568e8eef70ca1f0d0035808e","56d0a2081ecc471500f1795e","56d4a0adde40c70b00823ea3","56d96b03dd90610b00270849","56fbb83d8f21c817002af880","573c811bee2b3b2200422be1","576bc92afb62dd20001cda85","5771811e27a5c20e00030dcd","5785191af3a10c0e009b75b0","57bdf84d5d48411900cd8dc0","57ff5c5dc135231700aed806","5804caf792398f0f00e77521","58458b4fba4f1c0f009692bb","586d3c287c6b5b2300c05055","58ef66d88646742f009a0216","58f5d52d7891630f00fe4e77","59a555bccdbd85001bfb1442","5a2a81f688574d001e9934f5","5b080c8d7833b20003ddbb6f"],"is_deprecated":false,"is_hidden":false,"is_beta":true,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"updates":["5888bf6752d5b70f004e33fb","5a398eb7467a790034961bec","5a4642b03f866700300d97b3","5a6f83bd9b29600012a75988","5a92daa420cacd00127d563c"],"next":{"pages":[],"description":""},"createdAt":"2016-12-05T15:48:28.014Z","link_external":false,"link_url":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":0,"body":"<a name=\"overview\"></a>\n##Overview\n\nThe CGC hosts large, multi-omic datasets along with the tools to query, filter, and browse them. The resulting data can be added to your projects and analyzed alongside your private data to address your research questions.\n\nConsistent with terminology used by the Genomic Data Commons (GDC), datasets on the CGC are divided into two categories: \"harmonized\" and \"legacy\". In 2016, the GDC started hosting and distributing previously generated data from The Cancer Genome Atlas (TCGA). Additionally, for all submitted sequence data (FASTQs and BAM alignment files), the GDC generated new alignments (BAM files) to the latest human reference genome, GRCh38, using standard workflows. Using these alignments, the GDC generated derived data, including normal and tumor variant and mutation calls, gene and miRNA expression profiles, and splice junction quantification data. The GDC refers to this process of data generation through standard workflows as data harmonization.\nDatasets on the CGC that are aligned to GRCh38 or that use a similar data model as the GRCh38 datasets from the GDC are labeled \"harmonized\". Datasets on the CGC that are not aligned to GRCh38 or that use a different data model are labeled \"legacy\". \"Legacy\" datasets remain fully supported. \n\nBelow, learn more about datasets on the CGC and access resources describing each dataset's data and metadata.\n\n<a name=\"tcga\"></a>\n##The Cancer Genome Atlas (TCGA)\n\nTCGA is one of the world’s largest cancer genomics data collections, including more than eleven thousand patients, representing 33 cancers, and over half a million total files. Data collection for TCGA began in 2006 as a joint effort by the <a href=\"https://www.cancer.gov/\" target=\"blank\">National Cancer Institute (NCI)</a>, <a href=\"https://www.genome.gov/\" target=\"blank\">National Human Genome Research Institute (NHGRI)</a>, <a href=\"https://www.nih.gov/\" target=\"blank\">the National Institutes of Health (NIH)</a>, and the <a href=\"http://www.hhs.gov/\" target=\"blank\">U.S. Department of Health and Human Services</a>. The CGC provides powerful methods to query and reproducibly analyze TCGA data by itself or in conjunction with your own data.\n\nTCGA data on the CGC includes both <a href=\"https://wiki.nci.nih.gov/display/TCGA/Open+Access+and+Controlled+Access+Data\" target=\"blank\">Open and Controlled Data</a>. While all TCGA data is stripped of direct identifiers, DNA information is inherently unique to an individual. Two types of data access ‘tiers’ (Open Data and Controlled Data) have been put in place to balance the desire to make TCGA data as widely available as possible while ensuring that the rights of study participants are well protected. You can access TCGA Open Data on the CGC as soon as you sign up and agree to the TCGA data use and publication policies. In addition, you can obtain access to Controlled Data through the NIH via the <a href=\"https://www.ncbi.nlm.nih.gov/gap\" target=\"blank\">Database of Genotypes and Phenotypes (dbGaP) site</a>.\n\nThere are two iterations of TCGA dataset on the CGC:\n  * [TCGA](#section-tcga): the \"legacy\" version of the dataset\n  * [TCGA GRCh38](#section-tcga-grcg38): the \"harmonized\" version of the dataset\n\nLearn more about their differences below.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n###TCGA\n\nTCGA is a \"legacy\" dataset that contains TCGA data from the original genome build produced by CGHub. This dataset was imported before the GDC completed their harmonized data model. In addition, the CGC hosts the harmonized version of TCGA, [TCGA GRCh38](#section-tcga-grch38), as discussed below.\n\nThe TCGA dataset is termed \"legacy\" in accordance with the GDC labeling convention because its sequence data was not aligned to GRCh38. Note that the metadata fields available for the legacy TCGA dataset are different from those available for [TCGA GRCh38](#section-tcga-grch38).\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"tcga-resources\"></a>\n###TCGA Resources\n* [Required permissions to access TCGA data](doc:dbgap-controlled-data-access)\n* [TCGA data](doc:tcga-data) \n* [TCGA metadata schema](doc:tcga-metadata) \n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n###TCGA GRCh38\n\nTCGA GRCh38 is a \"harmonized\" dataset that contains BAM files derived from TCGA FASTQs that have been re-aligned to GRCh38. Note that TCGA GRCh38 does not contain the FASTQs themselves. We've aligned our TCGA data to the GDC's harmonized data model so users can access the same data using similar search terms.\n \nThe CGC also hosts a non-harmonized (\"legacy\") version of TCGA, named [TCGA](#section-tcga), as discussed above. Note that the metadata fields available for TCGA GRCh38 are different from those available for the TCGA dataset available on the CGC.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n###TCGA GRCh38 Resources\n* [Required permissions to access TCGA data](doc:dbgap-controlled-data-access)\n* [TCGA GRCh38 data](doc:tcga-grch38-data) \n* [TCGA GRCh38 metadata](doc:tcga-grch38-metadata)\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"cptac\"></a>\n##Clinical Proteomic Tumor Analysis Consortium (CPTAC) \n\nThe [Clinical Proteomic Tumor Analysis Consortium (CPTAC)](https://proteomics.cancer.gov/programs/cptac) is a comprehensive and coordinated effort to accelerate understanding of the molecular basis of cancer through the application of robust, quantitative, proteomic technologies and workflows.\n\nThe CPTAC analyzes cancer biospecimens from genomics initiatives such as [The Cancer Genome Atlas (TCGA)](https://cancergenome.nih.gov/) by mass spectrometry to characterize and quantify their constituent proteins or “proteome”. These mass spectrometry data are present in four different file formats including raw mass spectrometry spectra in vendor-specific file formats and processed peptide spectrum match (PSM) data.\n\n<a name=\"cptac-resources\"></a>\n###CPTAC Resources\n* [CPTAC data](doc:cptac-data) \n* [CPTAC metadata](doc:cptac-metadata)\n* [CPTAC public project](doc:the-clinical-proteomic-tumor-analysis-consortium-cptac-project)  \n\n<a name=\"tcia\"></a>\n##The Cancer Imaging Archive (TCIA) dataset\n\n[The Cancer Imaging Archive (TCIA)](http://www.cancerimagingarchive.net/) contains radiological imaging data generated as part of [The Cancer Genome Atlas (TCGA)](http://cancergenome.nih.gov/) with the aim of connecting cancer phenotypes to genotypes by providing matched clinical imaging and genomic analysis data. \n\nTCIA includes Open Access radiological images that represent 21 types of cancer detailed in TCGA. These images are stored in a standard [DICOM](https://en.wikipedia.org/wiki/DICOM) format.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"tcia-resources\"></a>\n###TCIA Resources\n* [TCIA data](doc:tcia-data)\n* [TCIA metadata](doc:tcia-metadata) \n* [TCIA public project](doc:the-cancer-imaging-archive-tcia-project) \n* [TCIA Metadata](doc:tcia-metadata) \n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n##ICGC dataset\n\nThe [International Cancer Genome Consortium (ICGC)](https://icgc.org/) coordinates a global network of research groups that aims to generate and publicly release comprehensive catalogues of genomic, transcriptomic, and epigenomic information across 50 different cancer types and/or subtypes of clinical and societal importance.\n\nICGC data is available through several distributed repositories. Through the CGC, authorized users can access all data hosted in ICGC's [AWS-Virginia repository](https://dcc.icgc.org/repositories?filters=%7B%22file%22:%7B%22repoName%22:%7B%22is%22:%5B%22AWS%20-%20Virginia%22%5D%7D,%22study%22:%7B%22is%22:%5B%22PCAWG%22%5D%7D%7D%7D&files=%7B%22from%22:1,%22size%22:25%7D), which includes whole genome sequencing and RNA sequencing data generated as part of the [PanCancer Analysis of Whole Genomes (PCAWG) Study](https://dcc.icgc.org/pcawg) and analyzed using a common set of alignment and variant calling workflows. \n\nNote that all ICGC data is [Controlled Data](doc:dbgap-controlled-data-access#section-controlled-data).\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n###ICGC Resources\n\n* [ICGC data](doc:icgc-data) \n* [ICGC metadata](doc:icgc-metadata) \n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"ccle\"></a>\n##Cancer Cell Line Encyclopedia (CCLE)\n\nThe Cancer Cell Line Encyclopedia (CCLE) is a project performing detailed genetic and pharmacologic characterization of a large number of immortalized human cancer cell lines.  The CCLE is the result of a collaboration between the <a href=\"https://www.broadinstitute.org/\" target=\"blank\">Broad Institute</a>, the <a href=\"https://www.nibr.com/\" target=\"blank\">Novartis Institutes for Biomedical Research</a>, and the <a href=\"https://www.gnf.nibr.com/\" target=\"blank\">Genomics Institute of the Novartis Research Foundation</a>.\n\nCCLE is a referred to as a \"legacy\" dataset on the CGC in accordance with the GDC labeling convention for datasets not aligned to GRCh38. It contains Open Access sequencing data (in the form of reads aligned to the hg19 reference genome) for nearly 1000 cancer cell line samples that was obtained from CGHub on May 11, 2016.\n\n<a name=\"ccle-resources\"></a>\n###CCLE Resources\n* [CCLE data](doc:ccle-data) \n* [CCLE metadata schema](doc:ccle-metadata) \n* [CCLE public project](doc:ccle) \n\n<a name=\"sgdp\"></a>\n##Simons Genome Diversity Project (SGDP) dataset\n\nThe <a href=\"https://www.simonsfoundation.org/life-sciences/simons-genome-diversity-project-dataset/\" target=\"blank\">Simons Genome Diversity Project (SGDP) dataset</a> contains complete genome sequences from more than one hundred diverse human populations. It is the largest dataset of diverse, high quality human genome sequences ever reported. To represent as much anthropological, linguistic, and cultural diversity as possible, the dataset includes many deeply divergent human populations that are not well-represented in other datasets. \n\nSGDP is available on the CGC as a read-only public project that contains Open Access whole genome sequencing data for 279 samples. Note that SGDP data is available for use in your analyses but is not currently accessible via the Data Browser.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"sgdp-resources\"></a>\n##SGDP Resources\n* [SGDP data](doc:sgdp-data) \n* [SGDP public project](doc:simons-genome-diversity-project-sgdp-dataset) \n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n## PERSONAL GENOME PROJECT UK (PGP-UK) pilot dataset\nThe Personal Genome Project UK (PGP-UK) was founded in 2013 by Professor Stephan Beck at University College London (UCL) and has been established as a research project to provide open access multi-omic data to advance and accelerate personalised genomics and medicine. The PGP-UK is recruiting and sequencing healthy participants from the UK using an open-consent recruitment protocol.\n\nFor the PGP-UK pilot study ten participants and three Genome Donors are actively engaged as citizen scientists with the project. UCL and Seven Bridges are collaborating to make the initial (pilot) set of PGP-UK data available to academic researchers.\n\nThe PGP-UK pilot dataset is available on the CGC as a read-only [public project](https://cgc.sbgenomics.com/u/sevenbridges/personal-genome-project-uk-pgp-uk/) that contains [Open Access](#section-licensing-information) multi-omics profiling data for thirteen participants who have been profiled using whole genome sequencing (WGS) of DNA from whole blood, whole genome bisulphite sequencing (WGBS) of DNA methylation from whole blood (WGBS), deep and shallow sequencing of RNA from whole blood using RNA-seq and DNA methylation array profiling of both whole blood and saliva using the HumanMethylation450 BeadChip from Illumina. Note that PGP-UK data is available for use in your analyses, but is not currently accessible via the Data Browser.\n\n## Licensing information\nAll of the data has been made available by the PGP-UK under the [CC-0 license](https://creativecommons.org/publicdomain/zero/1.0/) or an equivalent public domain license and can also be downloaded directly from the [PGP-UK Data Portal](https://www.personalgenomes.org.uk/data/). The data available in the PGP-UK [public project](https://cgc.sbgenomics.com/u/sevenbridges/personal-genome-project-uk-pgp-uk/) on the CGC were downloaded from the [PGP-UK Data Portal](https://www.personalgenomes.org.uk/data/) on 18 March 2018.\n\n## PGP-UK Resources\n* [PGP-UK data](doc:pgp-uk-data)\n* [PGP-UK metadata](doc:pgp-uk-metadata)\n* [PGP-UK public project](doc:personal-genome-project-uk-pgp-uk-pilot-dataset)\n\n<a name=\"Get-started\"></a>\n##Get started\n\n1. [Start from a broad overview](browse-datasets) of any dataset available on the CGC via the visual interface.\n2. [Refine your results with a query](query-datasets) issued on the visual interface or programmatically.\n3. [Access data](access-data-from-datasets) for further analysis in your CGC project.","excerpt":"","slug":"about-datasets","type":"basic","title":"ABOUT DATASETS"}
<a name="overview"></a> ##Overview The CGC hosts large, multi-omic datasets along with the tools to query, filter, and browse them. The resulting data can be added to your projects and analyzed alongside your private data to address your research questions. Consistent with terminology used by the Genomic Data Commons (GDC), datasets on the CGC are divided into two categories: "harmonized" and "legacy". In 2016, the GDC started hosting and distributing previously generated data from The Cancer Genome Atlas (TCGA). Additionally, for all submitted sequence data (FASTQs and BAM alignment files), the GDC generated new alignments (BAM files) to the latest human reference genome, GRCh38, using standard workflows. Using these alignments, the GDC generated derived data, including normal and tumor variant and mutation calls, gene and miRNA expression profiles, and splice junction quantification data. The GDC refers to this process of data generation through standard workflows as data harmonization. Datasets on the CGC that are aligned to GRCh38 or that use a similar data model as the GRCh38 datasets from the GDC are labeled "harmonized". Datasets on the CGC that are not aligned to GRCh38 or that use a different data model are labeled "legacy". "Legacy" datasets remain fully supported. Below, learn more about datasets on the CGC and access resources describing each dataset's data and metadata. <a name="tcga"></a> ##The Cancer Genome Atlas (TCGA) TCGA is one of the world’s largest cancer genomics data collections, including more than eleven thousand patients, representing 33 cancers, and over half a million total files. Data collection for TCGA began in 2006 as a joint effort by the <a href="https://www.cancer.gov/" target="blank">National Cancer Institute (NCI)</a>, <a href="https://www.genome.gov/" target="blank">National Human Genome Research Institute (NHGRI)</a>, <a href="https://www.nih.gov/" target="blank">the National Institutes of Health (NIH)</a>, and the <a href="http://www.hhs.gov/" target="blank">U.S. Department of Health and Human Services</a>. The CGC provides powerful methods to query and reproducibly analyze TCGA data by itself or in conjunction with your own data. TCGA data on the CGC includes both <a href="https://wiki.nci.nih.gov/display/TCGA/Open+Access+and+Controlled+Access+Data" target="blank">Open and Controlled Data</a>. While all TCGA data is stripped of direct identifiers, DNA information is inherently unique to an individual. Two types of data access ‘tiers’ (Open Data and Controlled Data) have been put in place to balance the desire to make TCGA data as widely available as possible while ensuring that the rights of study participants are well protected. You can access TCGA Open Data on the CGC as soon as you sign up and agree to the TCGA data use and publication policies. In addition, you can obtain access to Controlled Data through the NIH via the <a href="https://www.ncbi.nlm.nih.gov/gap" target="blank">Database of Genotypes and Phenotypes (dbGaP) site</a>. There are two iterations of TCGA dataset on the CGC: * [TCGA](#section-tcga): the "legacy" version of the dataset * [TCGA GRCh38](#section-tcga-grcg38): the "harmonized" version of the dataset Learn more about their differences below. <div align="right"><a href="#top">top</a></div> ###TCGA TCGA is a "legacy" dataset that contains TCGA data from the original genome build produced by CGHub. This dataset was imported before the GDC completed their harmonized data model. In addition, the CGC hosts the harmonized version of TCGA, [TCGA GRCh38](#section-tcga-grch38), as discussed below. The TCGA dataset is termed "legacy" in accordance with the GDC labeling convention because its sequence data was not aligned to GRCh38. Note that the metadata fields available for the legacy TCGA dataset are different from those available for [TCGA GRCh38](#section-tcga-grch38). <div align="right"><a href="#top">top</a></div> <a name="tcga-resources"></a> ###TCGA Resources * [Required permissions to access TCGA data](doc:dbgap-controlled-data-access) * [TCGA data](doc:tcga-data) * [TCGA metadata schema](doc:tcga-metadata) <div align="right"><a href="#top">top</a></div> ###TCGA GRCh38 TCGA GRCh38 is a "harmonized" dataset that contains BAM files derived from TCGA FASTQs that have been re-aligned to GRCh38. Note that TCGA GRCh38 does not contain the FASTQs themselves. We've aligned our TCGA data to the GDC's harmonized data model so users can access the same data using similar search terms. The CGC also hosts a non-harmonized ("legacy") version of TCGA, named [TCGA](#section-tcga), as discussed above. Note that the metadata fields available for TCGA GRCh38 are different from those available for the TCGA dataset available on the CGC. <div align="right"><a href="#top">top</a></div> ###TCGA GRCh38 Resources * [Required permissions to access TCGA data](doc:dbgap-controlled-data-access) * [TCGA GRCh38 data](doc:tcga-grch38-data) * [TCGA GRCh38 metadata](doc:tcga-grch38-metadata) <div align="right"><a href="#top">top</a></div> <a name="cptac"></a> ##Clinical Proteomic Tumor Analysis Consortium (CPTAC) The [Clinical Proteomic Tumor Analysis Consortium (CPTAC)](https://proteomics.cancer.gov/programs/cptac) is a comprehensive and coordinated effort to accelerate understanding of the molecular basis of cancer through the application of robust, quantitative, proteomic technologies and workflows. The CPTAC analyzes cancer biospecimens from genomics initiatives such as [The Cancer Genome Atlas (TCGA)](https://cancergenome.nih.gov/) by mass spectrometry to characterize and quantify their constituent proteins or “proteome”. These mass spectrometry data are present in four different file formats including raw mass spectrometry spectra in vendor-specific file formats and processed peptide spectrum match (PSM) data. <a name="cptac-resources"></a> ###CPTAC Resources * [CPTAC data](doc:cptac-data) * [CPTAC metadata](doc:cptac-metadata) * [CPTAC public project](doc:the-clinical-proteomic-tumor-analysis-consortium-cptac-project) <a name="tcia"></a> ##The Cancer Imaging Archive (TCIA) dataset [The Cancer Imaging Archive (TCIA)](http://www.cancerimagingarchive.net/) contains radiological imaging data generated as part of [The Cancer Genome Atlas (TCGA)](http://cancergenome.nih.gov/) with the aim of connecting cancer phenotypes to genotypes by providing matched clinical imaging and genomic analysis data. TCIA includes Open Access radiological images that represent 21 types of cancer detailed in TCGA. These images are stored in a standard [DICOM](https://en.wikipedia.org/wiki/DICOM) format. <div align="right"><a href="#top">top</a></div> <a name="tcia-resources"></a> ###TCIA Resources * [TCIA data](doc:tcia-data) * [TCIA metadata](doc:tcia-metadata) * [TCIA public project](doc:the-cancer-imaging-archive-tcia-project) * [TCIA Metadata](doc:tcia-metadata) <div align="right"><a href="#top">top</a></div> ##ICGC dataset The [International Cancer Genome Consortium (ICGC)](https://icgc.org/) coordinates a global network of research groups that aims to generate and publicly release comprehensive catalogues of genomic, transcriptomic, and epigenomic information across 50 different cancer types and/or subtypes of clinical and societal importance. ICGC data is available through several distributed repositories. Through the CGC, authorized users can access all data hosted in ICGC's [AWS-Virginia repository](https://dcc.icgc.org/repositories?filters=%7B%22file%22:%7B%22repoName%22:%7B%22is%22:%5B%22AWS%20-%20Virginia%22%5D%7D,%22study%22:%7B%22is%22:%5B%22PCAWG%22%5D%7D%7D%7D&files=%7B%22from%22:1,%22size%22:25%7D), which includes whole genome sequencing and RNA sequencing data generated as part of the [PanCancer Analysis of Whole Genomes (PCAWG) Study](https://dcc.icgc.org/pcawg) and analyzed using a common set of alignment and variant calling workflows. Note that all ICGC data is [Controlled Data](doc:dbgap-controlled-data-access#section-controlled-data). <div align="right"><a href="#top">top</a></div> ###ICGC Resources * [ICGC data](doc:icgc-data) * [ICGC metadata](doc:icgc-metadata) <div align="right"><a href="#top">top</a></div> <a name="ccle"></a> ##Cancer Cell Line Encyclopedia (CCLE) The Cancer Cell Line Encyclopedia (CCLE) is a project performing detailed genetic and pharmacologic characterization of a large number of immortalized human cancer cell lines. The CCLE is the result of a collaboration between the <a href="https://www.broadinstitute.org/" target="blank">Broad Institute</a>, the <a href="https://www.nibr.com/" target="blank">Novartis Institutes for Biomedical Research</a>, and the <a href="https://www.gnf.nibr.com/" target="blank">Genomics Institute of the Novartis Research Foundation</a>. CCLE is a referred to as a "legacy" dataset on the CGC in accordance with the GDC labeling convention for datasets not aligned to GRCh38. It contains Open Access sequencing data (in the form of reads aligned to the hg19 reference genome) for nearly 1000 cancer cell line samples that was obtained from CGHub on May 11, 2016. <a name="ccle-resources"></a> ###CCLE Resources * [CCLE data](doc:ccle-data) * [CCLE metadata schema](doc:ccle-metadata) * [CCLE public project](doc:ccle) <a name="sgdp"></a> ##Simons Genome Diversity Project (SGDP) dataset The <a href="https://www.simonsfoundation.org/life-sciences/simons-genome-diversity-project-dataset/" target="blank">Simons Genome Diversity Project (SGDP) dataset</a> contains complete genome sequences from more than one hundred diverse human populations. It is the largest dataset of diverse, high quality human genome sequences ever reported. To represent as much anthropological, linguistic, and cultural diversity as possible, the dataset includes many deeply divergent human populations that are not well-represented in other datasets. SGDP is available on the CGC as a read-only public project that contains Open Access whole genome sequencing data for 279 samples. Note that SGDP data is available for use in your analyses but is not currently accessible via the Data Browser. <div align="right"><a href="#top">top</a></div> <a name="sgdp-resources"></a> ##SGDP Resources * [SGDP data](doc:sgdp-data) * [SGDP public project](doc:simons-genome-diversity-project-sgdp-dataset) <div align="right"><a href="#top">top</a></div> ## PERSONAL GENOME PROJECT UK (PGP-UK) pilot dataset The Personal Genome Project UK (PGP-UK) was founded in 2013 by Professor Stephan Beck at University College London (UCL) and has been established as a research project to provide open access multi-omic data to advance and accelerate personalised genomics and medicine. The PGP-UK is recruiting and sequencing healthy participants from the UK using an open-consent recruitment protocol. For the PGP-UK pilot study ten participants and three Genome Donors are actively engaged as citizen scientists with the project. UCL and Seven Bridges are collaborating to make the initial (pilot) set of PGP-UK data available to academic researchers. The PGP-UK pilot dataset is available on the CGC as a read-only [public project](https://cgc.sbgenomics.com/u/sevenbridges/personal-genome-project-uk-pgp-uk/) that contains [Open Access](#section-licensing-information) multi-omics profiling data for thirteen participants who have been profiled using whole genome sequencing (WGS) of DNA from whole blood, whole genome bisulphite sequencing (WGBS) of DNA methylation from whole blood (WGBS), deep and shallow sequencing of RNA from whole blood using RNA-seq and DNA methylation array profiling of both whole blood and saliva using the HumanMethylation450 BeadChip from Illumina. Note that PGP-UK data is available for use in your analyses, but is not currently accessible via the Data Browser. ## Licensing information All of the data has been made available by the PGP-UK under the [CC-0 license](https://creativecommons.org/publicdomain/zero/1.0/) or an equivalent public domain license and can also be downloaded directly from the [PGP-UK Data Portal](https://www.personalgenomes.org.uk/data/). The data available in the PGP-UK [public project](https://cgc.sbgenomics.com/u/sevenbridges/personal-genome-project-uk-pgp-uk/) on the CGC were downloaded from the [PGP-UK Data Portal](https://www.personalgenomes.org.uk/data/) on 18 March 2018. ## PGP-UK Resources * [PGP-UK data](doc:pgp-uk-data) * [PGP-UK metadata](doc:pgp-uk-metadata) * [PGP-UK public project](doc:personal-genome-project-uk-pgp-uk-pilot-dataset) <a name="Get-started"></a> ##Get started 1. [Start from a broad overview](browse-datasets) of any dataset available on the CGC via the visual interface. 2. [Refine your results with a query](query-datasets) issued on the visual interface or programmatically. 3. [Access data](access-data-from-datasets) for further analysis in your CGC project.