QUERY DATASETS

Overview

Build queries to locate a subset of data within a dataset for further analysis. Queries use metadata properties and their values to filter datasets for specific entities. The CGC provides the following three query methods:

  • Method one: the Data Browser - Use the Data Browser to query datasets via an intuitive visual interface. The Data Browser's autosuggest and search functionalities facilitate querying while alleviating the need to learn new terminology.
  • Method two: SPARQL - Programmatically issue a SPARQL query to the Seven Bridges public SPARQL endpoint.
  • Method three: the Datasets API - Use the Datasets API to programmatically browse and query datasets using API requests written in JSON. The Datasets API is suitable for queries containing numerous parameters since the query is formatted as a concise dictionary.

Comparing the three query methods

In this section, we query the TCGA dataset for file entities and use metadata properties to filter results to file entities which come from RNA-Seq analyses of cases that have the vital status "alive," a gender of "female," and a diagnosis of "breast cancer". This query is reproduced three times below, once in each of the three query methods.

Method one: the Data Browser

Use the Data Browser to construct a query in the CGC visual interface.

The query shown below starts from the Case entity and provides specific values for its metadata properties of Disease type, Gender, and Vital status. For example, Gender has a value of FEMALE. The query below also contains a File entity with specific values for Access level, Experimental strategy, and Data type. The File entity designates that we are looking for files for the cases that match the specified criteria.

Below the query, refreshable count cards reveal that 972 cases and 4,349 files match the query criteria.

Learn more about the Data Browser.

10341034

Method two: SPARQL

Programmatically issue a SPARQL query to the Seven Bridges public SPARQL endpoint.

The query below returns the values for variables in the select clause (case_id``, file_name, file, file_id, vital_status, and days_to_follow) when they meet the conditions specified in the where clause.

The where clause designates a TCGA case which has a disease type of Breast Invasive Carcinoma, a gender of FEMALE, and a vital status of Alive. The query further filters for cases which have a case_id, days_to_follow, and a file. This file must have a file_name, a file_id, an access level of Open, an experimental strategy of RNA-Seq, and a data type of Gene expression`.

Learn more about SPARQL.

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix tcga: <https://www.sbgenomics.com/ontologies/2014/11/tcga#>
 
select distinct ?case_id ?file_name ?file ?file_id ?vital_status ?days_to_follow
where
{
  ?case a tcga:Case .
  ?case rdfs:label ?case_id .
    
  ?case tcga:hasDiseaseType ?dt .
  ?dt rdfs:label 'Breast Invasive Carcinoma' .
 
  ?case tcga:hasGender ?gender.
  ?gender rdfs:label 'FEMALE' .
 
  ?case tcga:hasVitalStatus ?vs .
  ?vs rdfs:label 'Alive' .
    
  ?case tcga:hasDaysToLastFollowUp ?days_to_follow .
 
  ?case tcga:hasFile ?file .
  
  ?file rdfs:label ?file_name .
  ?file tcga:hasFileID ?file_id .
    
  ?file tcga:hasAccessLevel ?ac .
  ?ac rdfs:label 'Open' .
    
  ?file tcga:hasExperimentalStrategy ?es .
  ?es rdfs:label 'RNA-Seq'.
    
  ?file tcga:hasDataType ?dat.
  ?dat rdfs:label 'Gene expression'
}

Method three: the Datasets API

Query datasets programmatically via a JSON request using the Datasets API.

The query below searches for members of the files entity with the following metadata properties: an access level of Open, a data type of Gene expression, and an experimental strategy of RNA-Seq. These files are for cases with a disease type of Breast Invasive Carcinoma, a gender of FEMALE, and a vital status of Alive.

Learn more about the Datasets API.

POST datasets/tcga/v0/query HTTP/1.1
Host: cgc-datasets-api.sbgenomics.com
X-SBG-Auth-Token: 3210a98c1db9304ea9d9273156740f74
{
    "entity": "files",
    "hasAccessLevel" : "Open",
    "hasDataType" : "Gene expression",
    "hasExperimentalStrategy": "RNA-Seq",
    "hasCase": {
        "hasDiseaseType" : "Breast Invasive Carcinoma",
        "hasGender" : "FEMALE",
        "hasVitalStatus" : "Alive"
    }
}

Resources