ABOUT DATASETS

Overview

The CGC hosts large, multi-omic datasets along with the tools to query, filter, and browse them. The resulting data can be added to your projects and analyzed alongside your private data to address your research questions.

Consistent with terminology used by the Genomic Data Commons (GDC), datasets on the CGC are divided into two categories: "harmonized" and "legacy". In 2016, the GDC started hosting and distributing previously generated data from The Cancer Genome Atlas (TCGA). Additionally, for all submitted sequence data (FASTQs and BAM alignment files), the GDC generated new alignments (BAM files) to the latest human reference genome, GRCh38, using standard workflows. Using these alignments, the GDC generated derived data, including normal and tumor variant and mutation calls, gene and miRNA expression profiles, and splice junction quantification data. The GDC refers to this process of data generation through standard workflows as data harmonization.

Datasets on the CGC that are aligned to GRCh38 or that use a similar data model as the GRCh38 datasets from the GDC are labeled "harmonized". Datasets on the CGC that are not aligned to GRCh38 or that use a different data model are labeled "legacy". "Legacy" datasets remain fully supported.

Below, learn more about datasets on the CGC and access resources describing each dataset's data and metadata.

The Cancer Genome Atlas (TCGA)

TCGA is one of the world’s largest cancer genomics data collections, including more than eleven thousand patients, representing 33 cancers, and over half a million total files. Data collection for TCGA began in 2006 as a joint effort by the National Cancer Institute (NCI), National Human Genome Research Institute (NHGRI), the National Institutes of Health (NIH), and the U.S. Department of Health and Human Services. The CGC provides powerful methods to query and reproducibly analyze TCGA data by itself or in conjunction with your own data.

TCGA data on the CGC includes both Open and Controlled Data. While all TCGA data is stripped of direct identifiers, DNA information is inherently unique to an individual. Two types of data access ‘tiers’ (Open Data and Controlled Data) have been put in place to balance the desire to make TCGA data as widely available as possible while ensuring that the rights of study participants are well protected. You can access TCGA Open Data on the CGC as soon as you sign up and agree to the TCGA data use and publication policies. In addition, you can obtain access to Controlled Data through the NIH via the Database of Genotypes and Phenotypes (dbGaP) site.

There are two iterations of TCGA dataset on the CGC:

  • TCGA: the "legacy" version of the dataset
  • TCGA GRCh38: the "harmonized" version of the dataset

Learn more about their differences below.

TCGA

TCGA is a "legacy" dataset that contains TCGA data from the original genome build produced by CGHub. This dataset was imported before the GDC completed their harmonized data model. In addition, the CGC hosts the harmonized version of TCGA, TCGA GRCh38, as discussed below.

The TCGA dataset is termed "legacy" in accordance with the GDC labeling convention because its sequence data was not aligned to GRCh38. Note that the metadata fields available for the legacy TCGA dataset are different from those available for TCGA GRCh38.

TCGA Resources

TCGA GRCh38

TCGA GRCh38 is a "harmonized" dataset that contains BAM files derived from TCGA FASTQs that have been re-aligned to GRCh38. Note that TCGA GRCh38 does not contain the FASTQs themselves. We've aligned our TCGA data to the GDC's harmonized data model so users can access the same data using similar search terms.

The CGC also hosts a non-harmonized ("legacy") version of TCGA, named TCGA, as discussed above. Note that the metadata fields available for TCGA GRCh38 are different from those available for the TCGA dataset available on the CGC.

TCGA GRCh38 Resources

Clinical Proteomic Tumor Analysis Consortium (CPTAC)

The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a comprehensive and coordinated effort to accelerate understanding of the molecular basis of cancer through the application of robust, quantitative, proteomic technologies and workflows.

The CPTAC analyzes cancer biospecimens from genomics initiatives such as The Cancer Genome Atlas (TCGA) by mass spectrometry to characterize and quantify their constituent proteins or “proteome”. These mass spectrometry data are present in four different file formats including raw mass spectrometry spectra in vendor-specific file formats and processed peptide spectrum match (PSM) data.

CPTAC Resources

The Cancer Imaging Archive (TCIA) dataset

The Cancer Imaging Archive (TCIA) contains radiological imaging data generated as part of The Cancer Genome Atlas (TCGA) with the aim of connecting cancer phenotypes to genotypes by providing matched clinical imaging and genomic analysis data.

TCIA includes Open Access radiological images that represent 21 types of cancer detailed in TCGA. These images are stored in a standard DICOM format.

TCIA Resources

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

The Therapeutically Applicable Research to Generate Effective Treatments (TARGET) dataset was generated by a diverse consortium of investigators to facilitate the discovery of genetic changes that underlie the development and progression of childhood cancers for which there currently are limited treatment options. The initiative is jointly managed within the National Cancer Institute (NCI) by the Office of Cancer Genomics (OCG) and the Cancer Therapy Evaluation Program (CTEP) and builds upon the resources of several other high-profile NCI cancer genomics programs, including The Cancer Genome Atlas (TCGA), the Cancer Genome Characterization Initiative (CGCI), and the Strategic Partnership to Evaluate Cancer Signatures (SPECS).

TARGET GRCh38 Resources

ICGC dataset

The International Cancer Genome Consortium (ICGC) coordinates a global network of research groups that aims to generate and publicly release comprehensive catalogues of genomic, transcriptomic, and epigenomic information across 50 different cancer types and/or subtypes of clinical and societal importance.

ICGC data is available through several distributed repositories. Through the CGC, authorized users can access all data hosted in ICGC's AWS-Virginia repository, which includes whole genome sequencing and RNA sequencing data generated as part of the PanCancer Analysis of Whole Genomes (PCAWG) Study and analyzed using a common set of alignment and variant calling workflows.

Note that all ICGC data is Controlled Data.

ICGC Resources

Cancer Cell Line Encyclopedia (CCLE)

The Cancer Cell Line Encyclopedia (CCLE) is a project performing detailed genetic and pharmacologic characterization of a large number of immortalized human cancer cell lines. The CCLE is the result of a collaboration between the Broad Institute, the Novartis Institutes for Biomedical Research, and the Genomics Institute of the Novartis Research Foundation.

CCLE is a referred to as a "legacy" dataset on the CGC in accordance with the GDC labeling convention for datasets not aligned to GRCh38. It contains Open Access sequencing data (in the form of reads aligned to the hg19 reference genome) for nearly 1000 cancer cell line samples that was obtained from CGHub on May 11, 2016.

CCLE Resources

Simons Genome Diversity Project (SGDP) dataset

The Simons Genome Diversity Project (SGDP) dataset contains complete genome sequences from more than one hundred diverse human populations. It is the largest dataset of diverse, high quality human genome sequences ever reported. To represent as much anthropological, linguistic, and cultural diversity as possible, the dataset includes many deeply divergent human populations that are not well-represented in other datasets.

SGDP is available on the CGC as a read-only public project that contains Open Access whole genome sequencing data for 279 samples. Note that SGDP data is available for use in your analyses but is not currently accessible via the Data Browser.

SGDP Resources

PERSONAL GENOME PROJECT UK (PGP-UK) pilot dataset

The Personal Genome Project UK (PGP-UK) was founded in 2013 by Professor Stephan Beck at University College London (UCL) and has been established as a research project to provide open access multi-omic data to advance and accelerate personalised genomics and medicine. The PGP-UK is recruiting and sequencing healthy participants from the UK using an open-consent recruitment protocol.

For the PGP-UK pilot study ten participants and three Genome Donors are actively engaged as citizen scientists with the project. UCL and Seven Bridges are collaborating to make the initial (pilot) set of PGP-UK data available to academic researchers.

The PGP-UK pilot dataset is available on the CGC as a read-only public project that contains Open Access multi-omics profiling data for thirteen participants who have been profiled using whole genome sequencing (WGS) of DNA from whole blood, whole genome bisulphite sequencing (WGBS) of DNA methylation from whole blood (WGBS), deep and shallow sequencing of RNA from whole blood using RNA-seq and DNA methylation array profiling of both whole blood and saliva using the HumanMethylation450 BeadChip from Illumina. Note that PGP-UK data is available for use in your analyses, but is not currently accessible via the Data Browser.

Licensing information

All of the data has been made available by the PGP-UK under the CC-0 license or an equivalent public domain license and can also be downloaded directly from the PGP-UK Data Portal. The data available in the PGP-UK public project on the CGC were downloaded from the PGP-UK Data Portal on 18 March 2018.

PGP-UK Resources

GDC Datasets Update Policy

Seven Bridges is committed to providing CGC users with up-to-date versions of the datasets that are available from the NCI Genomic Data Commons (GDC). Therefore, we have a clearly formulated set of rules that apply to updates of GDC datasets that are available through the CGC:

  • We aim to update the data on the CGC within 30 day of release by the GDC.
  • The time frame for alignment of datasets available through the CGC with the current GDC data release is within 30 days of the release by GDC.
  • If a GDC data release includes redaction of files from a dataset, the affected files will be available on the CGC for an additional 30 days. After that, you will need to contact the GDC for information on how to retain access to redacted files.
  • Re-running queries executed in the past may return slightly different results due to updates in the datasets from the GDC. This is expected as datasets are dynamic and version updates can introduce file updates or redactions, and queries will return the most up to date version of files. This applies both to the queries made through the Data Browser and through the Datasets API.

Get started

  1. Start from a broad overview of any dataset available on the CGC via the visual interface.
  2. Refine your results with a query issued on the visual interface or programmatically.
  3. Access data for further analysis in your CGC project.