{"_id":"57bb4eb86436180e006ea447","__v":0,"user":{"_id":"5613e4f8fdd08f2b00437620","username":"","name":"Emile Young"},"initVersion":{"_id":"55faf11ba62ba1170021a9aa","version":"1.0"},"project":"55faf11ba62ba1170021a9a7","createdAt":"2016-08-22T19:12:56.759Z","changelog":[],"body":"<a name=\"top\"></a>\n[block:callout]\n{\n  \"type\": \"warning\",\n  \"title\": \"On this page:\",\n  \"body\": \"* [Overview](#overview)\\n* [Prerequisites](#prerequisites)\\n* [Method one: the Datasets API](#datasets-api)\\n * [Query using the Datasets API](#query-via-datasets)\\n * [Access TCGA data using the CGC API](#access-tcga)\\n   * [Set up your authentication token](#authentication-token)\\n   * [Initialize the sevenbridges-python library](#initialize)\\n   * [Access TCGA data](#access-data)\\n* [Method two: the SPARQL console](#sparql)\\n * [Query using a SPARQL query](#query-via-sparql)\\n * [Access TCGA data using the CGC API](#access-tcga-2)\\n   * [Access TCGA data](#access-data-2)\\n* [Conclusion](#conclusion)\"\n}\n[/block]\nTCGA is one of the world’s largest cancer genomics data collections, including more than eleven thousand patients, representing 33 cancers, and over half a million total files. Seven Bridges has created a unified metadata ontology from the diverse cancer studies, made this data available, and provided compute infrastructure to facilitate customized analyses on the [Cancer Genomics Cloud (the CGC)](http://www.cancergenomicscloud.org/). The CGC provides powerful methods to query and reproducibly analyze TCGA data - alone or in conjunction with your own data.\n\nWe continue to develop new methods of interacting with data on the CGC, however, we also appreciate that sometimes it is useful to be able to analyze data locally, or in an AWS environment that you have configured yourself. While the CGC has undergone thorough testing and is certified as a FISMA-moderate system, if you wish to analyze data in alternative locations, you must take the appropriate steps to ensure your computing environment is secure and compliant with [current best practices](http://www.ncbi.nlm.nih.gov/projects/gap/pdf/dbgap_2b_security_procedures.pdf). If you plan to download large numbers of files for local analysis, we recommend using the download utilities available from the [Genomic Data Commons](https://gdc.nci.nih.gov/) which have been specifically optimized for this purpose.\n\nIn this tutorial, we describe two ways that you can programmatically access TCGA data.\n\n<a name=\"overview\"></a>\n##Overview\nWe will demonstrate how you can use either the [Datasets API](http://docs.cancergenomicscloud.org/docs/datasets-api-overview) or the [SPARQL console](https://opensparql.sbgenomics.com/#/console) to find all open access gene expression files obtained from RNA-Seq analysis of living female Breast Cancer patients.\n\nThe Datasets API and SPARQL endpoint both allow you to query a number of TCGA entities, including:\n\n  * analytes\n  * radiation therapies\n  * drug therapies\n  * follow ups\n  * portions\n  * aliquots\n  * samples\n  * slides\n  * new tumor events\n  * files\n\nAdditionally, a SPARQL query can return metadata fields, which lets you access and manipulate properties like metadata values. This gives you more flexibility with your query. The Datasets API, on the other hand, is well-suited for browsing TCGA data. You can learn more about the TCGA metadata ontology [here](http://docs.cancergenomicscloud.org/docs/tcga-metadata-on-the-cgc)\n\nThis tutorial includes Python snippets. You can simply read the tutorial below. Or, to run the code contained in this blog post, see the accompanying Jupyter notebooks for:\n\n  * the [Datasets API](https://github.com/sbg/okAPI/blob/master/Tutorials/CGC/access_TCGA_on_AWS_via_DatasetsAPI.ipynb) method\n  * the [SPARQL console](https://github.com/sbg/okAPI/blob/master/Tutorials/CGC/access_TCGA_on_AWS.ipynb) method\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"prerequisites\"></a>\n##Prerequisites\n\nBefore you begin this tutorial, you should:\n1. **Set up your CGC account.** If you haven't already done so, navigate to https://cgc.sbgenomics.com/ and follow these [directions](doc:sign-up-for-the-cgc) to register for the CGC. This tutorial uses Open Data, which is available to all CGC users. The same approach can be used by approved researchers to access Controlled Data. Learn more about TCGA data access here.\n2. **Install the Seven Bridges' API Python library.** This tutorial uses the library `sevenbridges-python`. Learn how to [install it](announcing-the-release-of-seven-bridges-api-clients-in-r-and-python) before continuing.\n3. **Obtain your authentication token.** You'll use your authentication token to encode your user credentials when interacting with the CGC programmatically. Learn how to [access your authentication token](http://docs.cancergenomicscloud.org/docs/get-your-authentication-token). It is important to store your authentication token in a safe place as it can be used to access your account. The time and location your token was last used is shown on the developer dashboard. If for any reason you believe your token has been compromised, you can regenerate it at any time.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"datasets-api\"></a>\n##Method one: the Datasets API\n\nIn this section, we'll query TCGA using the Datasets API. Then, we'll access the results of our query using the CGC API. We've formatted this section to contain explanations as well as Python snippets. You can always follow along on the [Jupyter notebook](https://github.com/sbg/okAPI/blob/master/Tutorials/CGC/access_TCGA_on_AWS_via_DatasetsAPI.ipynb) for this method.\n\nAlternatively, you can query TCGA using a SPARQL query, as demonstrated in [Method Two](#sparql) below. The same query is issued both methods.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"query-via-datasets\"></a>\n###Query using the Datasets API\nThe Datasets API is an API designed around the TCGA data structure and focused on search functionality. You can use the Datasets API to browse TCGA using API requests written in JSON. Queries made using the Datasets API return entities and are particularly suitable for browsing TCGA data.\n\nWe'll write a Python script to issue our query into TCGA using the Datasets API. Since the Datasets API isn't included in our Python library, `sevenbridges-python`, we will use two Python modules, `json` and `requests`, to interact with it instead. We'll use these modules to write a wrapper around the API request.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"import json\\nfrom requests import request\",\n      \"language\": \"python\",\n      \"name\": \"Import Python modules\"\n    }\n  ]\n}\n[/block]\nBelow, we define a simple function to send and receive JSON from the API using the correctly formatted HTTP calls. The necessary imports are handled above.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"def api_call(path, method='GET', query=None, data=None, token=None):\\n     \\n    base_url = 'https://cgc-datasets-api.sbgenomics.com/datasets/tcga/v0/'\\n     \\n    data = json.dumps(data) if isinstance(data, dict) \\\\\\n    or isinstance(data,list) else None\\n               \\n    headers = {\\n        'X-SBG-Auth-Token': token,\\n        'Accept': 'application/json',\\n        'Content-type': 'application/json',\\n    }\\n     \\n    response = request(method, base_url + path, params=query, \\\\\\n                       data=data, headers=headers)\\n    response_dict = response.json() if response.json() else {}\\n \\n    if response.status_code / 100 != 2:\\n        print(response_dict)\\n        print('Error Code: %i.' % (response_dict['code']))\\n        print(response_dict['more_info'])\\n        raise Exception('Server responded with status code %s.' \\\\\\n                        % response.status_code)\\n    return response_dict\",\n      \"language\": \"python\",\n      \"name\": \"Define an API call wrapper\"\n    }\n  ]\n}\n[/block]\nThen, provide your authentication token, as shown below.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"auth_token = 'insert your auth token here'\",\n      \"language\": \"python\",\n      \"name\": \"Provide authentication token\"\n    }\n  ]\n}\n[/block]\nNow, we can define a query in JSON for TCGA data based on its [metadata](doc:tcga-metadata-on-the-cgc).\n\nWe want to find **female**, **Breast Cancer** patients (**cases**) with a vital status (**alive**) and the associated **files** which are **open-access**, provide **Gene expression**, and came from the **experimental strategy** of **RNA-seq**. We will assign an exact value to the above properties.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"query_body = {\\n    \\\"entity\\\": \\\"files\\\",\\n    \\\"hasAccessLevel\\\" : \\\"Open\\\",\\n    \\\"hasDataType\\\" : \\\"Gene expression\\\",\\n    \\\"hasExperimentalStrategy\\\": \\\"RNA-Seq\\\",\\n    \\\"hasCase\\\": {\\n        \\\"hasDiseaseType\\\" : \\\"Breast Invasive Carcinoma\\\",\\n        \\\"hasGender\\\" : \\\"FEMALE\\\",\\n        \\\"hasVitalStatus\\\" : \\\"Alive\\\"\\n    }\\n}\",\n      \"language\": \"python\",\n      \"name\": \"Query body\"\n    }\n  ]\n}\n[/block]\nThe call below returns a dictionary containing the total number of records.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"total = api_call(method='POST', path ='query/total',\\n                 token=auth_token, data=query_body)\",\n      \"language\": \"python\",\n      \"name\": \"Query total\"\n    }\n  ]\n}\n[/block]\nNow, let's create an initial list of all records, 100 at a time. In the example below, this list is named `files_in_query`. Use this initial list to catalogue the data returned by the query.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"files_in_query = []\\n \\nfrom __future__ import division\\nfrom math import ceil\\n \\nloops = int(ceil(total['total']/100))\\n \\nfor ii in range(0,loops):\\n    files_in_query.append(api_call(method='POST',\\n                                   path =(\\\"query?offset=%i\\\" % (100*ii)),\\n                                   token=auth_token, data=query_body))\\n    print(\\\"%3.1f percent of files added\\\" % (100*(ii+1)/loops))\\n     \\n# NOTE: each item in file_list is a list of 100 files from the query. Example below:\\nprint('\\\\n \\\\n')\\nprint(files_in_query[0]['_embedded']['files'][0])\\nprint(files_in_query[1]['_embedded']['files'][0])\",\n      \"language\": \"python\",\n      \"name\": \"Create a list of all records\"\n    }\n  ]\n}\n[/block]\nWe've now successfully compiled a list of file ids!\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"access-tcga\"></a>\n\n###Access TCGA data using the CGC API\n\nIn this section, we will use the CGC API to access TCGA data. Since we are using the CGC API (as opposed to the Datasets API in the previous step), we can use the `sevenbridges-python` binding library to simplify our interaction with the API. You should have already installed this library as described under the **Prerequisites** section. You may also wish to take a look at the [library Quickstart guide](http://sevenbridges-python.readthedocs.io/en/latest/quickstart/#authentication-and-configuration) before moving forward. Before initializing the library, we recommend creating a config file to store your authentication token for use by the CGC API.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"authentication-token\"></a>\n**Set up your authentication token**\n\nSince we're now using the CGC API, we need to provide our authentication credentials. You can authenticate by storing your credentials in a config file, `$HOME/.sbgrc`. Enter your credentials in the config file, as shown below, replacing the last line with your authentication token:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"[cgc]\\napi-url = 'https://cgc-api.sbgenomics.com/v2'\\nauth-token = insert auth token here\",\n      \"language\": \"python\",\n      \"name\": \"Store your credentials\"\n    }\n  ]\n}\n[/block]\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"initialize\"></a>\n**Initialize the `sevenbridges-python` library**\n\nImport the `api` class from the official `sevenbridges-python` bindings and initialize the `api` object so the API knows our credentials.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"access-data\"></a>\n**Access TCGA data**\n\nLoop through the first ten files in the first item of the `files_in_query` list from above using the `id` key.\n\nWe will now do the following with these ids:\n1. Create a list of files on the CGC. From this point, it would be possible to take action on the CGC. For instance, you can [use a bioinformatics workflow or tool on these files and start an analysis](http://docs.cancergenomicscloud.org/docs/datasets-api-overview#section-interact-with-tcga-data).\n2. (optional) Generate a list of access links.\n3. Access each of the ten files in this list. They will be saved to the Downloads folder in your local directory. You could also modify this script to specify an alternative location for the files.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# 1) Generate a list a file objects from the file_ids list\\nfile_list = []\\nfor f in files_in_query[0]['_embedded']['files'][0:10]:\\n    file_list.append(api.files.get(id = f['id']))\\n    print(file_list[-1].name) \\n   \\n# (BRANCH-POINT) Do something AWESOME with these files on the CGC\\n \\n \\n# 2) (optional) Generate a list of download links\\ndl_list = []\\nfor f in file_list:\\n    dl_list.append(f.download_info())\\n \\n     \\n# 3) Download each of the files in the list to a downloads folder in your local directory.\\nimport os\\n \\ndl_dir = 'downloads'\\ntry:\\n    os.stat(dl_dir)\\nexcept:\\n    os.mkdir(dl_dir)\\n \\nfor f in file_list:\\n    f.download(path = (\\\"%s/%s\\\" % (dl_dir, f.name)))\",\n      \"language\": \"python\",\n      \"name\": \"Access data\"\n    }\n  ]\n}\n[/block]\nThat's it! You've successfully located TCGA data satisfying your query using the Datasets API and accessed it for further analysis using the CGC API. You can learn more about [querying TCGA with the Datasets API](doc:query-tcga-via-the-datasets-api) on our Knowledge Center.\n\nAlternatively, you can [try Method two](#sparql) to query TCGA data from the SPARQL endpoint.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"sparql\"></a>\n##Method two: the SPARQL console\nIn this section, we'll query TCGA using a SPARQL query. Then, we'll access the results of our query using the CGC API. We've formatted this section to contain explanations as well as Python snippets. You can always follow along on the [Jupyter notebook](https://github.com/sbg/okAPI/blob/master/Tutorials/CGC/access_TCGA_on_AWS.ipynb) for this method.\n\nAlternatively, you can query TCGA using the [Datasets API](#datasets-api), as demonstrated in Method One above. The same query is issued in either method.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"query-via-sparql\"></a>\n###Query using a SPARQL query\n\nYou can query TCGA data using the query language SPARQL (recursively short for SPARQL Protocol and RDF Query Language). Seven Bridges has made a public SPARQL endpoint to which you can send these queries. In addition to returning entities, the SPARQL query can also return properties such as TCGA metadata fields.\n\nSince we'll use a Python script to query TCGA data with a SPARQL query, we'll need to import four modules. The first two, <a href=\"https://docs.python.org/2/library/urllib.html\" target=\"blank\">urllib</a> and <a href=\"https://rdflib.github.io/sparqlwrapper/\" target=\"blank\">SPARQLWrapper</a>, are used to check the OpenSPARQL endpoint and to construct the SPARQL object.\n\nWe'll also need two other modules: `json` and `requests`. We'll use these to write a wrapper around the CGC API request in the next step.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# Needed to query a RDF database of hosted TCGA data\\nimport urllib\\nimport SPARQLWrapper as spark\",\n      \"language\": \"python\",\n      \"name\": \"Import\"\n    }\n  ]\n}\n[/block]\nThis tutorial relies on a public endpoint. First, ensure the end point is currently operational. Then, initialize the SPARQL object as below.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# Check SPARQL endpoint\\ntry:\\n    rc = urllib.urlopen(\\\"https://opensparql.sbgenomics.com\\\").getcode()\\nexcept Exception:\\n    rc = 0\\nif rc != 200:\\n    print(\\n        \\\"\\\"\\\"script relies on sparql endpoint\\n        (https://opensparql.sbgenomics.com/)\\n        which is currently not responding.\\n        Can not continue, exiting.\\\"\\\"\\\")\\n    raise KeyboardInterrupt\\nelse:\\n    print(\\\"Endpoint is operational, we are good to go!\\\")\\n     \\n     \\n# Initialize SPARQL object\\nsparql_endpoint = \\\"https://opensparql.sbgenomics.com/blazegraph/namespace/tcga_metadata_kb/sparql\\\"\\nsparql = spark.SPARQLWrapper(sparql_endpoint)\",\n      \"language\": \"python\",\n      \"name\": \"Check and initialize SPARQL\"\n    }\n  ]\n}\n[/block]\nNow, we can define a query for TCGA data based on its [metadata](doc:tcga-metadata-on-the-cgc).\n\nWe want to search for **female**, **Breast Cancer** patients (**cases**) who are **alive** and the associated **files** which are **open-access**, provide **Gene expression**, and came from an **experimental strategy** of **RNA-seq**. We will assign an exact value to the above properties.\n\nThe query language used consists of [RDF triple patterns](http://docs.cancergenomicscloud.org/v1.0/docs/query-tcga-metadata-programmatically#section-sparql-semantics) containing a subject, predicate, and object. The query below leaves a few objects unspecified, such as `?days_to_follow`. Unlike specified objects such as `'Alive'` in `?vs rdfs:label 'Alive'`, unspecified objects like `?days_to_follow` in `?case tcga:hasDaystoLastFollowUp ?days_to_follow` simply have to exist to be returned by the query. Specified objects, however, must match a specific value. Unspecified objects are thus returned by the query and can subsequently be analyzed in Python. For example, we can directly display the distribution of Days to Last Followup (`?days_to_follow`) for all **cases** returned by this query.\n\nWe also include the unspecified object `?id` in `?file tcga:hasFileID ?id` so that this information is returned by the query. We will need it for accessing the file in the next step.\nBelow, we set the query and execute it. The query results are stored in an object named `results`.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# Create the query above as a block-string\\nquery = \\\"\\\"\\\"\\n    prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>\\n    prefix tcga: <https://www.sbgenomics.com/ontologies/2014/11/tcga#>\\n \\n    select distinct ?case_id ?file_name ?file ?id ?vital_status ?days_to_follow\\n    where\\n    {\\n      ?case a tcga:Case .\\n      ?case rdfs:label ?case_id .\\n       \\n      ?case tcga:hasDiseaseType ?dt .\\n      ?dt rdfs:label 'Breast Invasive Carcinoma' .\\n \\n      ?case tcga:hasGender ?gender.\\n      ?gender rdfs:label 'FEMALE' .\\n   \\n      ?case tcga:hasVitalStatus ?vs .\\n      ?vs rdfs:label 'Alive' .\\n       \\n      ?case tcga:hasDaysToLastFollowUp ?days_to_follow .\\n \\n      ?case tcga:hasFile ?file .\\n     \\n      ?file rdfs:label ?file_name .\\n      ?file tcga:hasFileId ?id .\\n       \\n      ?file tcga:hasAccessLevel ?ac .\\n      ?ac rdfs:label 'Open' .\\n       \\n      ?file tcga:hasExperimentalStrategy ?es .\\n      ?es rdfs:label 'RNA-Seq'.\\n       \\n      ?file tcga:hasDataType ?dat.\\n      ?dat rdfs:label 'Gene expression'\\n    }\\n\\\"\\\"\\\"\\n \\nsparql.setQuery(query)              # Define query on the wrapper\\nsparql.setReturnFormat(spark.JSON)  # We want server to return JSON to use\\nresults = sparql.query().convert()  # Convert results to Python object\",\n      \"language\": \"python\",\n      \"name\": \"Query body\"\n    }\n  ]\n}\n[/block]\nNow, we can find properties which are actionable in Python from the `results` object. Below, we extract two examples of properties, `UUID` and `Days to last followup`, which we can analyze in Python. Using this option, we can conduct further analysis on the data based on its metadata without downloading the files.\n\nNext, we pull out two properties which will be necessary for downloading the data, `Path` and `File name`.\n\nFinally, we print out summary stats about the query and list the first 10 results.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# Information (potentially actionable) about the query results\\nuuid_list = [result['case_id']['value'] for result in results['results']['bindings']]\\nday_to_follow_list = \\\\\\n[result['days_to_follow']['value'] for result in results['results']['bindings']]\\n \\n# Information for downloading files within the query\\nfile_paths = [result['path']['value'] for result in results['results']['bindings']]\\nfile_names = [result['file_name']['value'] for result in results['results']['bindings']]\\nfile_ids = [result['file']['value'].split('/')[-1] for result in results['results']['bindings']]\\n \\n# Print some information about the query results\\nprint(\\\"Query returned %i results, printing the first 10:\\\" % (len(uuid_list)))\\nfor ii in range(0,min(10, len(uuid_list))):\\n    print(\\\"Case UUID %s had %s days to last followup \\\\n\\\" \\\\\\n         % (uuid_list[ii], day_to_follow_list[ii]))\",\n      \"language\": \"python\",\n      \"name\": \"List results\"\n    }\n  ]\n}\n[/block]\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"access-data-2\"></a>\n###Access queried TCGA data using the CGC API\nSince we are using the CGC API (as opposed to the Datasets API in the previous step), we can use the sevenbridges-python binding library to simplify our interaction with the API (details above).\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"access-tcga-2\"></a>\n**Access TCGA data**\n\nNow, we loop through the first ten files in the first item of the `files_in_query` list from above using the `id` key.\n\nWe'll do the following with these ids:\n1. Create a list of files on the CGC. From this point, it would be possible to take action on the CGC. For instance, you can [use a bioinformatics workflow or tool on these files and start an analysis](http://docs.cancergenomicscloud.org/docs/datasets-api-overview#section-interact-with-tcga-data).\n3. (optional) Generate a list of access links.\n5. Access each of the ten files in this list. They will be saved to the Downloads folder in your local directory. You could also modify this script to specify an alternative location for the files.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# 1) Generate a list a file objects from the file_ids list\\nfile_list = []\\nfor f_id in file_ids[0:10]:\\n    file_list.append(api.files.get(id = f_id))\\n    print(file_list[-1].name)  \\n# (BRANCH-POINT) Do something AWESOME with these files on the CGC\\n  \\n  \\n# 2) (optional) Generate a list of download links\\ndl_list = []\\nfor f in file_list:\\n    dl_list.append(f.download_info())\\n  \\n      \\n# 3) Download each of the files in the list to a _downloads_ folder in your local directory.\\nimport os\\n  \\ndl_dir = 'downloads'\\ntry:\\n    os.stat(dl_dir)\\nexcept:\\n    os.mkdir(dl_dir)\\n  \\nfor f in file_list:\\n    f.download(path = (\\\"%s/%s\\\" % (dl_dir, f.name)))\",\n      \"language\": \"python\",\n      \"name\": \"Access data\"\n    }\n  ]\n}\n[/block]\nThat's it! You've now successfully filtered TCGA data using a SPARQL query and accessed the data for further analysis.You can find more [examples of SPARQL queries](doc:sample-sparql-queries) on our Knowledge Center.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n<a name=\"conclusion\"></a>\n##Conclusion\n\nCongratulations! You've learned to query TCGA data hosted on AWS using either the Datasets API or a SPARQL query and then access this data using the CGC API.\n\nNow, you have several options. For instance, you can use the file ids you obtained to [interact with the data you've obtained on the CGC](doc:files). Conversely, you can take the data you've accessed and use it for an analytical tool not stored on the CGC. Or, if you used a SPARQL query, you can access metadata parameters (such as `disease`, `days_to_death`, and `radiation_therapy`) which you can use in your own script, e.g. computing a survival analysis in Python.\n\nThe next move is yours: take the data to the analytical tool and environment of your choice.\n <div align=\"right\"><a href=\"#top\">top</a></div>","slug":"programmatically-access-tcga-data-using-the-seven-bridges-cancer-genomics-cloud","title":"Programmatically Access TCGA Data using the Seven Bridges Cancer Genomics Cloud"}

Programmatically Access TCGA Data using the Seven Bridges Cancer Genomics Cloud


<a name="top"></a> [block:callout] { "type": "warning", "title": "On this page:", "body": "* [Overview](#overview)\n* [Prerequisites](#prerequisites)\n* [Method one: the Datasets API](#datasets-api)\n * [Query using the Datasets API](#query-via-datasets)\n * [Access TCGA data using the CGC API](#access-tcga)\n * [Set up your authentication token](#authentication-token)\n * [Initialize the sevenbridges-python library](#initialize)\n * [Access TCGA data](#access-data)\n* [Method two: the SPARQL console](#sparql)\n * [Query using a SPARQL query](#query-via-sparql)\n * [Access TCGA data using the CGC API](#access-tcga-2)\n * [Access TCGA data](#access-data-2)\n* [Conclusion](#conclusion)" } [/block] TCGA is one of the world’s largest cancer genomics data collections, including more than eleven thousand patients, representing 33 cancers, and over half a million total files. Seven Bridges has created a unified metadata ontology from the diverse cancer studies, made this data available, and provided compute infrastructure to facilitate customized analyses on the [Cancer Genomics Cloud (the CGC)](http://www.cancergenomicscloud.org/). The CGC provides powerful methods to query and reproducibly analyze TCGA data - alone or in conjunction with your own data. We continue to develop new methods of interacting with data on the CGC, however, we also appreciate that sometimes it is useful to be able to analyze data locally, or in an AWS environment that you have configured yourself. While the CGC has undergone thorough testing and is certified as a FISMA-moderate system, if you wish to analyze data in alternative locations, you must take the appropriate steps to ensure your computing environment is secure and compliant with [current best practices](http://www.ncbi.nlm.nih.gov/projects/gap/pdf/dbgap_2b_security_procedures.pdf). If you plan to download large numbers of files for local analysis, we recommend using the download utilities available from the [Genomic Data Commons](https://gdc.nci.nih.gov/) which have been specifically optimized for this purpose. In this tutorial, we describe two ways that you can programmatically access TCGA data. <a name="overview"></a> ##Overview We will demonstrate how you can use either the [Datasets API](http://docs.cancergenomicscloud.org/docs/datasets-api-overview) or the [SPARQL console](https://opensparql.sbgenomics.com/#/console) to find all open access gene expression files obtained from RNA-Seq analysis of living female Breast Cancer patients. The Datasets API and SPARQL endpoint both allow you to query a number of TCGA entities, including: * analytes * radiation therapies * drug therapies * follow ups * portions * aliquots * samples * slides * new tumor events * files Additionally, a SPARQL query can return metadata fields, which lets you access and manipulate properties like metadata values. This gives you more flexibility with your query. The Datasets API, on the other hand, is well-suited for browsing TCGA data. You can learn more about the TCGA metadata ontology [here](http://docs.cancergenomicscloud.org/docs/tcga-metadata-on-the-cgc) This tutorial includes Python snippets. You can simply read the tutorial below. Or, to run the code contained in this blog post, see the accompanying Jupyter notebooks for: * the [Datasets API](https://github.com/sbg/okAPI/blob/master/Tutorials/CGC/access_TCGA_on_AWS_via_DatasetsAPI.ipynb) method * the [SPARQL console](https://github.com/sbg/okAPI/blob/master/Tutorials/CGC/access_TCGA_on_AWS.ipynb) method <div align="right"><a href="#top">top</a></div> <a name="prerequisites"></a> ##Prerequisites Before you begin this tutorial, you should: 1. **Set up your CGC account.** If you haven't already done so, navigate to https://cgc.sbgenomics.com/ and follow these [directions](doc:sign-up-for-the-cgc) to register for the CGC. This tutorial uses Open Data, which is available to all CGC users. The same approach can be used by approved researchers to access Controlled Data. Learn more about TCGA data access here. 2. **Install the Seven Bridges' API Python library.** This tutorial uses the library `sevenbridges-python`. Learn how to [install it](announcing-the-release-of-seven-bridges-api-clients-in-r-and-python) before continuing. 3. **Obtain your authentication token.** You'll use your authentication token to encode your user credentials when interacting with the CGC programmatically. Learn how to [access your authentication token](http://docs.cancergenomicscloud.org/docs/get-your-authentication-token). It is important to store your authentication token in a safe place as it can be used to access your account. The time and location your token was last used is shown on the developer dashboard. If for any reason you believe your token has been compromised, you can regenerate it at any time. <div align="right"><a href="#top">top</a></div> <a name="datasets-api"></a> ##Method one: the Datasets API In this section, we'll query TCGA using the Datasets API. Then, we'll access the results of our query using the CGC API. We've formatted this section to contain explanations as well as Python snippets. You can always follow along on the [Jupyter notebook](https://github.com/sbg/okAPI/blob/master/Tutorials/CGC/access_TCGA_on_AWS_via_DatasetsAPI.ipynb) for this method. Alternatively, you can query TCGA using a SPARQL query, as demonstrated in [Method Two](#sparql) below. The same query is issued both methods. <div align="right"><a href="#top">top</a></div> <a name="query-via-datasets"></a> ###Query using the Datasets API The Datasets API is an API designed around the TCGA data structure and focused on search functionality. You can use the Datasets API to browse TCGA using API requests written in JSON. Queries made using the Datasets API return entities and are particularly suitable for browsing TCGA data. We'll write a Python script to issue our query into TCGA using the Datasets API. Since the Datasets API isn't included in our Python library, `sevenbridges-python`, we will use two Python modules, `json` and `requests`, to interact with it instead. We'll use these modules to write a wrapper around the API request. [block:code] { "codes": [ { "code": "import json\nfrom requests import request", "language": "python", "name": "Import Python modules" } ] } [/block] Below, we define a simple function to send and receive JSON from the API using the correctly formatted HTTP calls. The necessary imports are handled above. [block:code] { "codes": [ { "code": "def api_call(path, method='GET', query=None, data=None, token=None):\n \n base_url = 'https://cgc-datasets-api.sbgenomics.com/datasets/tcga/v0/'\n \n data = json.dumps(data) if isinstance(data, dict) \\\n or isinstance(data,list) else None\n \n headers = {\n 'X-SBG-Auth-Token': token,\n 'Accept': 'application/json',\n 'Content-type': 'application/json',\n }\n \n response = request(method, base_url + path, params=query, \\\n data=data, headers=headers)\n response_dict = response.json() if response.json() else {}\n \n if response.status_code / 100 != 2:\n print(response_dict)\n print('Error Code: %i.' % (response_dict['code']))\n print(response_dict['more_info'])\n raise Exception('Server responded with status code %s.' \\\n % response.status_code)\n return response_dict", "language": "python", "name": "Define an API call wrapper" } ] } [/block] Then, provide your authentication token, as shown below. [block:code] { "codes": [ { "code": "auth_token = 'insert your auth token here'", "language": "python", "name": "Provide authentication token" } ] } [/block] Now, we can define a query in JSON for TCGA data based on its [metadata](doc:tcga-metadata-on-the-cgc). We want to find **female**, **Breast Cancer** patients (**cases**) with a vital status (**alive**) and the associated **files** which are **open-access**, provide **Gene expression**, and came from the **experimental strategy** of **RNA-seq**. We will assign an exact value to the above properties. [block:code] { "codes": [ { "code": "query_body = {\n \"entity\": \"files\",\n \"hasAccessLevel\" : \"Open\",\n \"hasDataType\" : \"Gene expression\",\n \"hasExperimentalStrategy\": \"RNA-Seq\",\n \"hasCase\": {\n \"hasDiseaseType\" : \"Breast Invasive Carcinoma\",\n \"hasGender\" : \"FEMALE\",\n \"hasVitalStatus\" : \"Alive\"\n }\n}", "language": "python", "name": "Query body" } ] } [/block] The call below returns a dictionary containing the total number of records. [block:code] { "codes": [ { "code": "total = api_call(method='POST', path ='query/total',\n token=auth_token, data=query_body)", "language": "python", "name": "Query total" } ] } [/block] Now, let's create an initial list of all records, 100 at a time. In the example below, this list is named `files_in_query`. Use this initial list to catalogue the data returned by the query. [block:code] { "codes": [ { "code": "files_in_query = []\n \nfrom __future__ import division\nfrom math import ceil\n \nloops = int(ceil(total['total']/100))\n \nfor ii in range(0,loops):\n files_in_query.append(api_call(method='POST',\n path =(\"query?offset=%i\" % (100*ii)),\n token=auth_token, data=query_body))\n print(\"%3.1f percent of files added\" % (100*(ii+1)/loops))\n \n# NOTE: each item in file_list is a list of 100 files from the query. Example below:\nprint('\\n \\n')\nprint(files_in_query[0]['_embedded']['files'][0])\nprint(files_in_query[1]['_embedded']['files'][0])", "language": "python", "name": "Create a list of all records" } ] } [/block] We've now successfully compiled a list of file ids! <div align="right"><a href="#top">top</a></div> <a name="access-tcga"></a> ###Access TCGA data using the CGC API In this section, we will use the CGC API to access TCGA data. Since we are using the CGC API (as opposed to the Datasets API in the previous step), we can use the `sevenbridges-python` binding library to simplify our interaction with the API. You should have already installed this library as described under the **Prerequisites** section. You may also wish to take a look at the [library Quickstart guide](http://sevenbridges-python.readthedocs.io/en/latest/quickstart/#authentication-and-configuration) before moving forward. Before initializing the library, we recommend creating a config file to store your authentication token for use by the CGC API. <div align="right"><a href="#top">top</a></div> <a name="authentication-token"></a> **Set up your authentication token** Since we're now using the CGC API, we need to provide our authentication credentials. You can authenticate by storing your credentials in a config file, `$HOME/.sbgrc`. Enter your credentials in the config file, as shown below, replacing the last line with your authentication token: [block:code] { "codes": [ { "code": "[cgc]\napi-url = 'https://cgc-api.sbgenomics.com/v2'\nauth-token = insert auth token here", "language": "python", "name": "Store your credentials" } ] } [/block] <div align="right"><a href="#top">top</a></div> <a name="initialize"></a> **Initialize the `sevenbridges-python` library** Import the `api` class from the official `sevenbridges-python` bindings and initialize the `api` object so the API knows our credentials. <div align="right"><a href="#top">top</a></div> <a name="access-data"></a> **Access TCGA data** Loop through the first ten files in the first item of the `files_in_query` list from above using the `id` key. We will now do the following with these ids: 1. Create a list of files on the CGC. From this point, it would be possible to take action on the CGC. For instance, you can [use a bioinformatics workflow or tool on these files and start an analysis](http://docs.cancergenomicscloud.org/docs/datasets-api-overview#section-interact-with-tcga-data). 2. (optional) Generate a list of access links. 3. Access each of the ten files in this list. They will be saved to the Downloads folder in your local directory. You could also modify this script to specify an alternative location for the files. [block:code] { "codes": [ { "code": "# 1) Generate a list a file objects from the file_ids list\nfile_list = []\nfor f in files_in_query[0]['_embedded']['files'][0:10]:\n file_list.append(api.files.get(id = f['id']))\n print(file_list[-1].name) \n \n# (BRANCH-POINT) Do something AWESOME with these files on the CGC\n \n \n# 2) (optional) Generate a list of download links\ndl_list = []\nfor f in file_list:\n dl_list.append(f.download_info())\n \n \n# 3) Download each of the files in the list to a downloads folder in your local directory.\nimport os\n \ndl_dir = 'downloads'\ntry:\n os.stat(dl_dir)\nexcept:\n os.mkdir(dl_dir)\n \nfor f in file_list:\n f.download(path = (\"%s/%s\" % (dl_dir, f.name)))", "language": "python", "name": "Access data" } ] } [/block] That's it! You've successfully located TCGA data satisfying your query using the Datasets API and accessed it for further analysis using the CGC API. You can learn more about [querying TCGA with the Datasets API](doc:query-tcga-via-the-datasets-api) on our Knowledge Center. Alternatively, you can [try Method two](#sparql) to query TCGA data from the SPARQL endpoint. <div align="right"><a href="#top">top</a></div> <a name="sparql"></a> ##Method two: the SPARQL console In this section, we'll query TCGA using a SPARQL query. Then, we'll access the results of our query using the CGC API. We've formatted this section to contain explanations as well as Python snippets. You can always follow along on the [Jupyter notebook](https://github.com/sbg/okAPI/blob/master/Tutorials/CGC/access_TCGA_on_AWS.ipynb) for this method. Alternatively, you can query TCGA using the [Datasets API](#datasets-api), as demonstrated in Method One above. The same query is issued in either method. <div align="right"><a href="#top">top</a></div> <a name="query-via-sparql"></a> ###Query using a SPARQL query You can query TCGA data using the query language SPARQL (recursively short for SPARQL Protocol and RDF Query Language). Seven Bridges has made a public SPARQL endpoint to which you can send these queries. In addition to returning entities, the SPARQL query can also return properties such as TCGA metadata fields. Since we'll use a Python script to query TCGA data with a SPARQL query, we'll need to import four modules. The first two, <a href="https://docs.python.org/2/library/urllib.html" target="blank">urllib</a> and <a href="https://rdflib.github.io/sparqlwrapper/" target="blank">SPARQLWrapper</a>, are used to check the OpenSPARQL endpoint and to construct the SPARQL object. We'll also need two other modules: `json` and `requests`. We'll use these to write a wrapper around the CGC API request in the next step. [block:code] { "codes": [ { "code": "# Needed to query a RDF database of hosted TCGA data\nimport urllib\nimport SPARQLWrapper as spark", "language": "python", "name": "Import" } ] } [/block] This tutorial relies on a public endpoint. First, ensure the end point is currently operational. Then, initialize the SPARQL object as below. [block:code] { "codes": [ { "code": "# Check SPARQL endpoint\ntry:\n rc = urllib.urlopen(\"https://opensparql.sbgenomics.com\").getcode()\nexcept Exception:\n rc = 0\nif rc != 200:\n print(\n \"\"\"script relies on sparql endpoint\n (https://opensparql.sbgenomics.com/)\n which is currently not responding.\n Can not continue, exiting.\"\"\")\n raise KeyboardInterrupt\nelse:\n print(\"Endpoint is operational, we are good to go!\")\n \n \n# Initialize SPARQL object\nsparql_endpoint = \"https://opensparql.sbgenomics.com/blazegraph/namespace/tcga_metadata_kb/sparql\"\nsparql = spark.SPARQLWrapper(sparql_endpoint)", "language": "python", "name": "Check and initialize SPARQL" } ] } [/block] Now, we can define a query for TCGA data based on its [metadata](doc:tcga-metadata-on-the-cgc). We want to search for **female**, **Breast Cancer** patients (**cases**) who are **alive** and the associated **files** which are **open-access**, provide **Gene expression**, and came from an **experimental strategy** of **RNA-seq**. We will assign an exact value to the above properties. The query language used consists of [RDF triple patterns](http://docs.cancergenomicscloud.org/v1.0/docs/query-tcga-metadata-programmatically#section-sparql-semantics) containing a subject, predicate, and object. The query below leaves a few objects unspecified, such as `?days_to_follow`. Unlike specified objects such as `'Alive'` in `?vs rdfs:label 'Alive'`, unspecified objects like `?days_to_follow` in `?case tcga:hasDaystoLastFollowUp ?days_to_follow` simply have to exist to be returned by the query. Specified objects, however, must match a specific value. Unspecified objects are thus returned by the query and can subsequently be analyzed in Python. For example, we can directly display the distribution of Days to Last Followup (`?days_to_follow`) for all **cases** returned by this query. We also include the unspecified object `?id` in `?file tcga:hasFileID ?id` so that this information is returned by the query. We will need it for accessing the file in the next step. Below, we set the query and execute it. The query results are stored in an object named `results`. [block:code] { "codes": [ { "code": "# Create the query above as a block-string\nquery = \"\"\"\n prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n prefix tcga: <https://www.sbgenomics.com/ontologies/2014/11/tcga#>\n \n select distinct ?case_id ?file_name ?file ?id ?vital_status ?days_to_follow\n where\n {\n ?case a tcga:Case .\n ?case rdfs:label ?case_id .\n \n ?case tcga:hasDiseaseType ?dt .\n ?dt rdfs:label 'Breast Invasive Carcinoma' .\n \n ?case tcga:hasGender ?gender.\n ?gender rdfs:label 'FEMALE' .\n \n ?case tcga:hasVitalStatus ?vs .\n ?vs rdfs:label 'Alive' .\n \n ?case tcga:hasDaysToLastFollowUp ?days_to_follow .\n \n ?case tcga:hasFile ?file .\n \n ?file rdfs:label ?file_name .\n ?file tcga:hasFileId ?id .\n \n ?file tcga:hasAccessLevel ?ac .\n ?ac rdfs:label 'Open' .\n \n ?file tcga:hasExperimentalStrategy ?es .\n ?es rdfs:label 'RNA-Seq'.\n \n ?file tcga:hasDataType ?dat.\n ?dat rdfs:label 'Gene expression'\n }\n\"\"\"\n \nsparql.setQuery(query) # Define query on the wrapper\nsparql.setReturnFormat(spark.JSON) # We want server to return JSON to use\nresults = sparql.query().convert() # Convert results to Python object", "language": "python", "name": "Query body" } ] } [/block] Now, we can find properties which are actionable in Python from the `results` object. Below, we extract two examples of properties, `UUID` and `Days to last followup`, which we can analyze in Python. Using this option, we can conduct further analysis on the data based on its metadata without downloading the files. Next, we pull out two properties which will be necessary for downloading the data, `Path` and `File name`. Finally, we print out summary stats about the query and list the first 10 results. [block:code] { "codes": [ { "code": "# Information (potentially actionable) about the query results\nuuid_list = [result['case_id']['value'] for result in results['results']['bindings']]\nday_to_follow_list = \\\n[result['days_to_follow']['value'] for result in results['results']['bindings']]\n \n# Information for downloading files within the query\nfile_paths = [result['path']['value'] for result in results['results']['bindings']]\nfile_names = [result['file_name']['value'] for result in results['results']['bindings']]\nfile_ids = [result['file']['value'].split('/')[-1] for result in results['results']['bindings']]\n \n# Print some information about the query results\nprint(\"Query returned %i results, printing the first 10:\" % (len(uuid_list)))\nfor ii in range(0,min(10, len(uuid_list))):\n print(\"Case UUID %s had %s days to last followup \\n\" \\\n % (uuid_list[ii], day_to_follow_list[ii]))", "language": "python", "name": "List results" } ] } [/block] <div align="right"><a href="#top">top</a></div> <a name="access-data-2"></a> ###Access queried TCGA data using the CGC API Since we are using the CGC API (as opposed to the Datasets API in the previous step), we can use the sevenbridges-python binding library to simplify our interaction with the API (details above). <div align="right"><a href="#top">top</a></div> <a name="access-tcga-2"></a> **Access TCGA data** Now, we loop through the first ten files in the first item of the `files_in_query` list from above using the `id` key. We'll do the following with these ids: 1. Create a list of files on the CGC. From this point, it would be possible to take action on the CGC. For instance, you can [use a bioinformatics workflow or tool on these files and start an analysis](http://docs.cancergenomicscloud.org/docs/datasets-api-overview#section-interact-with-tcga-data). 3. (optional) Generate a list of access links. 5. Access each of the ten files in this list. They will be saved to the Downloads folder in your local directory. You could also modify this script to specify an alternative location for the files. [block:code] { "codes": [ { "code": "# 1) Generate a list a file objects from the file_ids list\nfile_list = []\nfor f_id in file_ids[0:10]:\n file_list.append(api.files.get(id = f_id))\n print(file_list[-1].name) \n# (BRANCH-POINT) Do something AWESOME with these files on the CGC\n \n \n# 2) (optional) Generate a list of download links\ndl_list = []\nfor f in file_list:\n dl_list.append(f.download_info())\n \n \n# 3) Download each of the files in the list to a _downloads_ folder in your local directory.\nimport os\n \ndl_dir = 'downloads'\ntry:\n os.stat(dl_dir)\nexcept:\n os.mkdir(dl_dir)\n \nfor f in file_list:\n f.download(path = (\"%s/%s\" % (dl_dir, f.name)))", "language": "python", "name": "Access data" } ] } [/block] That's it! You've now successfully filtered TCGA data using a SPARQL query and accessed the data for further analysis.You can find more [examples of SPARQL queries](doc:sample-sparql-queries) on our Knowledge Center. <div align="right"><a href="#top">top</a></div> <a name="conclusion"></a> ##Conclusion Congratulations! You've learned to query TCGA data hosted on AWS using either the Datasets API or a SPARQL query and then access this data using the CGC API. Now, you have several options. For instance, you can use the file ids you obtained to [interact with the data you've obtained on the CGC](doc:files). Conversely, you can take the data you've accessed and use it for an analytical tool not stored on the CGC. Or, if you used a SPARQL query, you can access metadata parameters (such as `disease`, `days_to_death`, and `radiation_therapy`) which you can use in your own script, e.g. computing a survival analysis in Python. The next move is yours: take the data to the analytical tool and environment of your choice. <div align="right"><a href="#top">top</a></div>