{"_id":"584592a09f79b51900b0768c","__v":0,"project":"55faf11ba62ba1170021a9a7","user":"5613e4f8fdd08f2b00437620","category":{"_id":"58458b4fba4f1c0f009692bb","project":"55faf11ba62ba1170021a9a7","version":"55faf11ba62ba1170021a9aa","__v":0,"sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-12-05T15:44:15.650Z","from_sync":false,"order":6,"slug":"datasets-hub","title":"DATASETS HUB"},"parentDoc":null,"version":{"_id":"55faf11ba62ba1170021a9aa","project":"55faf11ba62ba1170021a9a7","__v":37,"createdAt":"2015-09-17T16:58:03.490Z","releaseDate":"2015-09-17T16:58:03.490Z","categories":["55faf11ca62ba1170021a9ab","55faf8f4d0e22017005b8272","55faf91aa62ba1170021a9b5","55faf929a8a7770d00c2c0bd","55faf932a8a7770d00c2c0bf","55faf94b17b9d00d00969f47","55faf958d0e22017005b8274","55faf95fa8a7770d00c2c0c0","55faf96917b9d00d00969f48","55faf970a8a7770d00c2c0c1","55faf98c825d5f19001fa3a6","55faf99aa62ba1170021a9b8","55faf99fa62ba1170021a9b9","55faf9aa17b9d00d00969f49","55faf9b6a8a7770d00c2c0c3","55faf9bda62ba1170021a9ba","5604570090ee490d00440551","5637e8b2fbe1c50d008cb078","5649bb624fa1460d00780add","5671974d1b6b730d008b4823","5671979d60c8e70d006c9760","568e8eef70ca1f0d0035808e","56d0a2081ecc471500f1795e","56d4a0adde40c70b00823ea3","56d96b03dd90610b00270849","56fbb83d8f21c817002af880","573c811bee2b3b2200422be1","576bc92afb62dd20001cda85","5771811e27a5c20e00030dcd","5785191af3a10c0e009b75b0","57bdf84d5d48411900cd8dc0","57ff5c5dc135231700aed806","5804caf792398f0f00e77521","58458b4fba4f1c0f009692bb","586d3c287c6b5b2300c05055","58ef66d88646742f009a0216","58f5d52d7891630f00fe4e77"],"is_deprecated":false,"is_hidden":false,"is_beta":true,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-12-05T16:15:28.233Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"settings":"","results":{"codes":[]},"auth":"required","params":[],"url":""},"isReference":false,"order":19,"body":"[block:callout]\n{\n  \"type\": \"warning\",\n  \"title\": \"On this page:\",\n  \"body\": \"* [Overview](#section-overview)\\n* [Comparing the three query methods](#section-comparing-the-three-query-methods)\\n   * [Method one: the Data Browser](#section-method-one-the-data-browser)\\n   * [Method two: SPARQL](#section-method-two-sparql)\\n   * [Method three: the Datasets API](#section-method-three-the-datasets-api)\\n* [Resources](#section-resources)\"\n}\n[/block]\n##Overview\n\nBuild queries to locate a subset of data within a dataset for further analysis. Queries use metadata properties and their values to filter datasets for specific entities. The CGC provides the following three query methods:\n\n  * [Method one: the Data Browser](#section-method-one-the-data-browser) - Use the Data Browser to query datasets via an intuitive visual interface. The Data Browser's autosuggest and search functionalities facilitate querying while alleviating the need to learn new terminology.\n  * [Method two: SPARQL](#section-method-two-sparql) - Use the Seven Bridges' public SPARQL endpoint to programmatically browse and query datasets. SPARQL queries are particularly efficient for returning large volumes of results. However, this method requires familiarity with the query language SPARQL as well as each dataset's metadata ontology.\n  * [Method three: the Datasets API](#section-method-three-the-datasets-api) - Use the Datasets API to programmatically browse and query datasets using API requests written in JSON. The Datasets API is suitable for queries containing numerous parameters since the query is formatted as a concise dictionary. However, this method is less efficient at returning large volumes of results than SPARQL. \n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n##Comparing the three query methods\nIn this section, we query the TCGA dataset for file entities and use metadata properties to filter results to file entities which come from RNA-Seq analyses of cases that have the vital status \"alive,\" a gender of \"female,\" and a diagnosis of \"breast cancer\". This query is reproduced three times below, once in each of the three query methods.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n###Method one: the Data Browser\n\nUse the Data Browser to construct a query in the CGC visual interface.\n\nThe query shown below starts from the **Case** entity and provides specific values for its metadata properties of **Disease type**, **Gender**, and **Vital status**. For example, **Gender** has a value of **FEMALE**. The query below also contains a File entity with specific values for **Access level**, **Experimental strategy**, and **Data type**. The **File** entity designates that we are looking for files for the cases that match the specified criteria.\n\nBelow the query, refreshable count cards reveal that 972 cases and 4,349 files match the query criteria.\n\nLearn more about the [Data Browser](about-the-data-browser).\n[block:image]\n{\n  \"images\": [\n    {\n      \"image\": [\n        \"https://files.readme.io/56e1904-Screen_Shot_2016-11-15_at_4.06.54_PM.png\",\n        \"Screen Shot 2016-11-15 at 4.06.54 PM.png\",\n        1034,\n        705,\n        \"#0e4d8d\"\n      ]\n    }\n  ]\n}\n[/block]\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n###Method two: SPARQL\n\nProgrammatically issue a SPARQL query to the Seven Bridges public SPARQL endpoint.\n\nThe query below returns the values for variables in the select clause (`case_id``, file_name`, `file`, `file_id`, `vital_status`, and `days_to_follow`) when they meet the conditions specified in the where clause.\n\nThe where clause designates a TCGA case which has a disease type of `Breast Invasive Carcinoma`, a gender of `FEMALE`, and a vital status of `Alive`. The query further filters for cases which have a `case_id`, `days_to_follow`, and a `file`. This `file` must have a `file_name`, a `file_id, an access level of `Open`, an experimental strategy of `RNA-Seq`, and a data type of `Gene expression`.\n\nLearn more about [SPARQL](doc:about-sparql).\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>\\nprefix tcga: <https://www.sbgenomics.com/ontologies/2014/11/tcga#>\\n \\nselect distinct ?case_id ?file_name ?file ?file_id ?vital_status ?days_to_follow\\nwhere\\n{\\n  ?case a tcga:Case .\\n  ?case rdfs:label ?case_id .\\n    \\n  ?case tcga:hasDiseaseType ?dt .\\n  ?dt rdfs:label 'Breast Invasive Carcinoma' .\\n \\n  ?case tcga:hasGender ?gender.\\n  ?gender rdfs:label 'FEMALE' .\\n \\n  ?case tcga:hasVitalStatus ?vs .\\n  ?vs rdfs:label 'Alive' .\\n    \\n  ?case tcga:hasDaysToLastFollowUp ?days_to_follow .\\n \\n  ?case tcga:hasFile ?file .\\n  \\n  ?file rdfs:label ?file_name .\\n  ?file tcga:hasFileID ?file_id .\\n    \\n  ?file tcga:hasAccessLevel ?ac .\\n  ?ac rdfs:label 'Open' .\\n    \\n  ?file tcga:hasExperimentalStrategy ?es .\\n  ?es rdfs:label 'RNA-Seq'.\\n    \\n  ?file tcga:hasDataType ?dat.\\n  ?dat rdfs:label 'Gene expression'\\n}\",\n      \"language\": \"text\",\n      \"name\": \"Sample SPARQL query\"\n    }\n  ]\n}\n[/block]\n<div align=\"right\"><a href=\"top\">top</div></a>\n\n###Method three: the Datasets API\n\nQuery datasets programmatically via a JSON request using the Datasets API.\n\nThe query below searches for members of the `files` entity with the following metadata properties: an access level of `Open`, a data type of `Gene expression`, and an experimental strategy of `RNA-Seq`. These `files` are for `cases` with a disease type of `Breast Invasive Carcinoma`, a gender of `FEMALE`, and a vital status of `Alive`.\n\nLearn more about the [Datasets API](doc:about-the-datasets-api).\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"POST datasets/tcga/v0/query HTTP/1.1\\nHost: cgc-datasets-api.sbgenomics.com\\nX-SBG-Auth-Token: 7942f56901534434a054dafc3813bc96\",\n      \"language\": \"http\",\n      \"name\": \"Sample Datasets API request\"\n    }\n  ]\n}\n[/block]\n\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"{\\n    \\\"entity\\\": \\\"files\\\",\\n    \\\"hasAccessLevel\\\" : \\\"Open\\\",\\n    \\\"hasDataType\\\" : \\\"Gene expression\\\",\\n    \\\"hasExperimentalStrategy\\\": \\\"RNA-Seq\\\",\\n    \\\"hasCase\\\": {\\n        \\\"hasDiseaseType\\\" : \\\"Breast Invasive Carcinoma\\\",\\n        \\\"hasGender\\\" : \\\"FEMALE\\\",\\n        \\\"hasVitalStatus\\\" : \\\"Alive\\\"\\n    }\\n}\",\n      \"language\": \"json\",\n      \"name\": \"Request body\"\n    }\n  ]\n}\n[/block]\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n##Resources\n* [About the Data Browser](doc:about-the-data-browser) \n* [About SPARQL](doc:about-sparql) \n* [About the Datasets API](doc:about-the-datasets-api) \n* [About metadata for datasets](doc:about-metadata-for-datasets) \n\n<div align=\"right\"><a href=\"#top\">top</a></div>","excerpt":"","slug":"query-datasets","type":"basic","title":"QUERY DATASETS"}
[block:callout] { "type": "warning", "title": "On this page:", "body": "* [Overview](#section-overview)\n* [Comparing the three query methods](#section-comparing-the-three-query-methods)\n * [Method one: the Data Browser](#section-method-one-the-data-browser)\n * [Method two: SPARQL](#section-method-two-sparql)\n * [Method three: the Datasets API](#section-method-three-the-datasets-api)\n* [Resources](#section-resources)" } [/block] ##Overview Build queries to locate a subset of data within a dataset for further analysis. Queries use metadata properties and their values to filter datasets for specific entities. The CGC provides the following three query methods: * [Method one: the Data Browser](#section-method-one-the-data-browser) - Use the Data Browser to query datasets via an intuitive visual interface. The Data Browser's autosuggest and search functionalities facilitate querying while alleviating the need to learn new terminology. * [Method two: SPARQL](#section-method-two-sparql) - Use the Seven Bridges' public SPARQL endpoint to programmatically browse and query datasets. SPARQL queries are particularly efficient for returning large volumes of results. However, this method requires familiarity with the query language SPARQL as well as each dataset's metadata ontology. * [Method three: the Datasets API](#section-method-three-the-datasets-api) - Use the Datasets API to programmatically browse and query datasets using API requests written in JSON. The Datasets API is suitable for queries containing numerous parameters since the query is formatted as a concise dictionary. However, this method is less efficient at returning large volumes of results than SPARQL. <div align="right"><a href="#top">top</a></div> ##Comparing the three query methods In this section, we query the TCGA dataset for file entities and use metadata properties to filter results to file entities which come from RNA-Seq analyses of cases that have the vital status "alive," a gender of "female," and a diagnosis of "breast cancer". This query is reproduced three times below, once in each of the three query methods. <div align="right"><a href="#top">top</a></div> ###Method one: the Data Browser Use the Data Browser to construct a query in the CGC visual interface. The query shown below starts from the **Case** entity and provides specific values for its metadata properties of **Disease type**, **Gender**, and **Vital status**. For example, **Gender** has a value of **FEMALE**. The query below also contains a File entity with specific values for **Access level**, **Experimental strategy**, and **Data type**. The **File** entity designates that we are looking for files for the cases that match the specified criteria. Below the query, refreshable count cards reveal that 972 cases and 4,349 files match the query criteria. Learn more about the [Data Browser](about-the-data-browser). [block:image] { "images": [ { "image": [ "https://files.readme.io/56e1904-Screen_Shot_2016-11-15_at_4.06.54_PM.png", "Screen Shot 2016-11-15 at 4.06.54 PM.png", 1034, 705, "#0e4d8d" ] } ] } [/block] <div align="right"><a href="#top">top</a></div> ###Method two: SPARQL Programmatically issue a SPARQL query to the Seven Bridges public SPARQL endpoint. The query below returns the values for variables in the select clause (`case_id``, file_name`, `file`, `file_id`, `vital_status`, and `days_to_follow`) when they meet the conditions specified in the where clause. The where clause designates a TCGA case which has a disease type of `Breast Invasive Carcinoma`, a gender of `FEMALE`, and a vital status of `Alive`. The query further filters for cases which have a `case_id`, `days_to_follow`, and a `file`. This `file` must have a `file_name`, a `file_id, an access level of `Open`, an experimental strategy of `RNA-Seq`, and a data type of `Gene expression`. Learn more about [SPARQL](doc:about-sparql). [block:code] { "codes": [ { "code": "prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>\nprefix tcga: <https://www.sbgenomics.com/ontologies/2014/11/tcga#>\n \nselect distinct ?case_id ?file_name ?file ?file_id ?vital_status ?days_to_follow\nwhere\n{\n ?case a tcga:Case .\n ?case rdfs:label ?case_id .\n \n ?case tcga:hasDiseaseType ?dt .\n ?dt rdfs:label 'Breast Invasive Carcinoma' .\n \n ?case tcga:hasGender ?gender.\n ?gender rdfs:label 'FEMALE' .\n \n ?case tcga:hasVitalStatus ?vs .\n ?vs rdfs:label 'Alive' .\n \n ?case tcga:hasDaysToLastFollowUp ?days_to_follow .\n \n ?case tcga:hasFile ?file .\n \n ?file rdfs:label ?file_name .\n ?file tcga:hasFileID ?file_id .\n \n ?file tcga:hasAccessLevel ?ac .\n ?ac rdfs:label 'Open' .\n \n ?file tcga:hasExperimentalStrategy ?es .\n ?es rdfs:label 'RNA-Seq'.\n \n ?file tcga:hasDataType ?dat.\n ?dat rdfs:label 'Gene expression'\n}", "language": "text", "name": "Sample SPARQL query" } ] } [/block] <div align="right"><a href="top">top</div></a> ###Method three: the Datasets API Query datasets programmatically via a JSON request using the Datasets API. The query below searches for members of the `files` entity with the following metadata properties: an access level of `Open`, a data type of `Gene expression`, and an experimental strategy of `RNA-Seq`. These `files` are for `cases` with a disease type of `Breast Invasive Carcinoma`, a gender of `FEMALE`, and a vital status of `Alive`. Learn more about the [Datasets API](doc:about-the-datasets-api). [block:code] { "codes": [ { "code": "POST datasets/tcga/v0/query HTTP/1.1\nHost: cgc-datasets-api.sbgenomics.com\nX-SBG-Auth-Token: 7942f56901534434a054dafc3813bc96", "language": "http", "name": "Sample Datasets API request" } ] } [/block] [block:code] { "codes": [ { "code": "{\n \"entity\": \"files\",\n \"hasAccessLevel\" : \"Open\",\n \"hasDataType\" : \"Gene expression\",\n \"hasExperimentalStrategy\": \"RNA-Seq\",\n \"hasCase\": {\n \"hasDiseaseType\" : \"Breast Invasive Carcinoma\",\n \"hasGender\" : \"FEMALE\",\n \"hasVitalStatus\" : \"Alive\"\n }\n}", "language": "json", "name": "Request body" } ] } [/block] <div align="right"><a href="#top">top</a></div> ##Resources * [About the Data Browser](doc:about-the-data-browser) * [About SPARQL](doc:about-sparql) * [About the Datasets API](doc:about-the-datasets-api) * [About metadata for datasets](doc:about-metadata-for-datasets) <div align="right"><a href="#top">top</a></div>