{"_id":"5643695c08894c0d00031ebd","__v":86,"version":{"_id":"55faf11ba62ba1170021a9aa","project":"55faf11ba62ba1170021a9a7","__v":37,"createdAt":"2015-09-17T16:58:03.490Z","releaseDate":"2015-09-17T16:58:03.490Z","categories":["55faf11ca62ba1170021a9ab","55faf8f4d0e22017005b8272","55faf91aa62ba1170021a9b5","55faf929a8a7770d00c2c0bd","55faf932a8a7770d00c2c0bf","55faf94b17b9d00d00969f47","55faf958d0e22017005b8274","55faf95fa8a7770d00c2c0c0","55faf96917b9d00d00969f48","55faf970a8a7770d00c2c0c1","55faf98c825d5f19001fa3a6","55faf99aa62ba1170021a9b8","55faf99fa62ba1170021a9b9","55faf9aa17b9d00d00969f49","55faf9b6a8a7770d00c2c0c3","55faf9bda62ba1170021a9ba","5604570090ee490d00440551","5637e8b2fbe1c50d008cb078","5649bb624fa1460d00780add","5671974d1b6b730d008b4823","5671979d60c8e70d006c9760","568e8eef70ca1f0d0035808e","56d0a2081ecc471500f1795e","56d4a0adde40c70b00823ea3","56d96b03dd90610b00270849","56fbb83d8f21c817002af880","573c811bee2b3b2200422be1","576bc92afb62dd20001cda85","5771811e27a5c20e00030dcd","5785191af3a10c0e009b75b0","57bdf84d5d48411900cd8dc0","57ff5c5dc135231700aed806","5804caf792398f0f00e77521","58458b4fba4f1c0f009692bb","586d3c287c6b5b2300c05055","58ef66d88646742f009a0216","58f5d52d7891630f00fe4e77"],"is_deprecated":false,"is_hidden":false,"is_beta":true,"is_stable":true,"codename":"","version_clean":"1.0.0","version":"1.0"},"category":{"_id":"57bdf84d5d48411900cd8dc0","version":"55faf11ba62ba1170021a9aa","__v":0,"project":"55faf11ba62ba1170021a9a7","sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-08-24T19:41:01.302Z","from_sync":false,"order":26,"slug":"api-hub","title":"API Hub"},"project":"55faf11ba62ba1170021a9a7","user":"554290cd6592e60d00027d17","parentDoc":null,"updates":["56d012648877db0b0065cb42","56e1aebdd2f9771900df1cba","56e325906e602e0e00700b50","571fbc74a0acd42000af958d"],"next":{"pages":[],"description":""},"createdAt":"2015-11-11T16:14:20.941Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":1,"body":"This Quickstart Guide leads you through a simple RNA sequencing analysis. It uses the CGC API, but its steps parallel the [Quickstart for the visual interface](http://docs.cancergenomicscloud.org/docs/quickstart). We have written this example in Python, but the concepts can be adapted to your preferred programming language. We encourage you to try this analysis yourself, as an aid to creating a script for your own custom analysis. The [documentation for the CGC API ](http://docs.cancergenomicscloud.org/docs/the-cgc-api)is available here.\n\n#Objective:\nWe will use the API to create a project, add files to it, add a workflow, create and run a task, then download the outputs.\n[block:callout]\n{\n  \"type\": \"info\",\n  \"title\": \"Scripts\",\n  \"body\": \"An ipython notebook for the python script built in this tutorial is available here: https://github.com/sbg/docs/blob/master/cgc/Cancer_Genomics_Cloud_API/CGC_API_quickstart.py\\n\\nYou can also see the python script used in this Quickstart here: https://github.com/sbg/docs/blob/master/cgc/Cancer_Genomics_Cloud_API/CGC_API_quickstart.py\"\n}\n[/block]\n\n[block:callout]\n{\n  \"type\": \"warning\",\n  \"title\": \"On this page\",\n  \"body\": \"* [Requirements](#section-requirements)\\n* [Preparatory work](#section-requirements)\\n* [Cancer Genomics Cloud API Quickstart](#section-cancer-genomics-cloud-api-quickstart)\\n * 1 [Create a project](#section-cancer-genomics-cloud-api-quickstart)\\n * 2 [Add files to the project](http://docs.cancergenomicscloud.org/v1.0/docs/api-quickstart#section-2-add-files-to-the-project)\\n   * (a) [Copy files from an existing project via the API](#section-cancer-genomics-cloud-api-quickstart)\\n   * (b) [Copy files from an existing project using the visual interface](#section--b-copy-files-from-an-existing-project-using-the-visual-interface)\\n   * (c) [Upload local files using the API and the command line uploader](#section--c-upload-local-files-using-the-api-and-the-command-line-uploader)\\n * 3 [Get a copy of the correct public workflow](#section-3-get-a-copy-of-the-correct-public-workflow)\\n * 4 [Build a file processing list for your analysis](#section-4-build-a-file-processing-list-for-your-analysis)\\n * 5 [Format, create, and start your tasks](#section-5-format-create-and-start-your-tasks)\\n * 6 [Check task completion](#section-6-check-task-completion)\\n  * 7 [Download Files](#section-7-download-files)\\n   * [Visualize files on the CGC visual interface](#section-visualize-files-on-the-cgc-visual-interface)\\n   * [Download files via the API ](#section-download-files-via-the-api)\"\n}\n[/block]\n#Requirements\nTo run the code in this tutorial, you will need:\n  * **Python version 2.7.x** (Python 3.x is not 100% compatible with all the code used in this example)\n  * **The Python requests module**. If you do not already have this, it can be installed via Python's package management system, pip: `$ sudo pip install requests`\n  * **The authentication token associated with your account**, which you can get by going to [the Developer Dashboard ](https://cgc.sbgenomics.com/account/#developer)after logging into your account. Remember to keep your authentication token secure!\n  * This project analyzes [TCGA Controlled Data](http://docs.cancergenomicscloud.org/docs/tcga-data-access) that is available on the CGC. To access the particular data file used here, you will need to have been awarded TCGA Controlled Data access through dbGaP. \n  * We show three ways of adding the controlled data file to your project. You can choose your preferred method:\n   * Find the file(s) you need with the **Case Explorer** and **Data Browser**. To learn more about this, follow the [QuickStart for the CGC visual interface](quickstart).\n   * Copy the file(s) from another project that you are a member of using the **CGC API.**\n   * Upload your own private data to analyze using the **CGC command line uploader**.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n\n#Preparatory work\nTo interact with the API, we send and receive data as JSON objects. Each JSON object received will represent one of the following:\n  * A requested resource, or resources, listed in the items field in the JSON array, \n  * An error, accompanied by text detailing the error in the message field. \nMost of the `GET`, `POST` and `PATCH` requests will only signal their success or failure by means of an HTTP status code in the response.\nFirst we import the necessary Python libraries and define the names of our new project and the desired workflow.\n[block:callout]\n{\n  \"type\": \"warning\",\n  \"body\": \"In the code below, please replace the `AUTH_TOKEN` string with your authentication token!\"\n}\n[/block]\n\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# IMPORTS\\nimport time as timer\\nfrom requests import request\\nimport json\\nfrom urllib2 import urlopen\\nimport os\\n\\n\\n# GLOBALS\\nFLAGS = {'targetFound': False,                  # target project exists in CGC project\\n         'taskRunning': False,                  # task is still running\\n         'startTasks': True                     # (False) create, but do NOT start tasks\\n        }\\n# project we will create in CGC (Settings > Project name in GUI)\\nTARGET_PROJECT = 'Quickstart_API'               \\nTARGET_APP = 'RNA-seq Alignment - STAR for TCGA PE tar' # app to use\\nINPUT_EXT = 'tar.gz'\\n\\n# TODO: replace AUTH_TOKEN with yours here\\nAUTH_TOKEN = 'AUTH_TOKEN'                       \",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nSince we are going to write the functions that interact with API in Python, we'll prepare a function that converts the information we send and receive into JSON. \n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"#  FUNCTIONS\\ndef api_call(path, method='GET', query=None, data=None, flagFullPath=False):\\n    \\\"\\\"\\\" Translates all the HTTP calls to interface with the CGC\\n\\n    This code adapted from the Seven Bridges platform API v1.1 example\\n    https://docs.sbgenomics.com/display/developerhub/Quickstart\\n    flagFullPath is novel, added to smoothly resolve pagination issues with the CGC API\\\"\\\"\\\"\\n    data = json.dumps(data) if isinstance(data, dict) or isinstance(data,list)  else None\\n    base_url = 'https://cgc-api.sbgenomics.com/v2/'\\n\\n    headers = {\\n        'X-SBG-Auth-Token': AUTH_TOKEN,\\n        'Accept': 'application/json',\\n        'Content-type': 'application/json',\\n    }\\n\\n    if flagFullPath:\\n        response = request(method, path, params=query, data=data, headers=headers)\\n    else:\\n        response = request(method, base_url + path, params=query, data=data, headers=headers)\\n    response_dict = json.loads(response.content) if response.content else {}\\n\\n    if response.status_code / 100 != 2:\\n        print response_dict['message']\\n        raise Exception('Server responded with status code %s.' % response.status_code)\\n    return response_dict\\n\\ndef hello():\\t# for debugging\\n    print(\\\"Is it me you're looking for?\\\")\\n    return True\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nWe will not only create objects but also need to interact with them. So in this demo we also may use object oriented programming. We have created a class API, defined is below. Generally, the API calls will either return a list of things (e.g. `myFiles` is plural) or a very detailed description of one thing (e.g. `myFile` is singular). The appropriate structure is created automatically in the `response_to_fields()` method. \n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"#  CLASSES\\nclass API(object):\\n    # making a class out of the api() function, adding other methods\\n    def __init__(self, path, method='GET', query=None, data=None, flagFullPath=False):\\n        self.flag = {'longList': False}\\n        response_dict = api_call(path, method, query, data, flagFullPath)\\n        self.response_to_fields(response_dict)\\n\\n        if self.flag['longList']:\\n            self.long_list(response_dict, path, method, query, data)\\n\\n    def response_to_fields(self,rd):\\n        if 'items' in rd.keys():  \\n            # get * {files, projects, tasks, apps} (object name plural)\\n            if len(rd['items']) > 0:\\n                self.list_read(rd)\\n            else:\\n                self.empty_read(rd)\\n        else:           \\n            # get details about ONE {file, project, task, app}  \\n            #  (object name singular)\\n            self.detail_read(rd)\\n\\n    def list_read(self,rd):\\n        n = len(rd['items'])\\n        keys = rd['items'][0].keys()\\n        m = len(keys)\\n\\n        for jj in range(m):\\n            temp = [None]*n\\n            for ii in range(n):\\n                temp[ii] = rd['items'][ii][keys[jj]]\\n            setattr(self, keys[jj], temp)\\n\\n        if ('links' in rd.keys()) & (len(rd['links']) > 0):\\n            self.flag['longList'] = True\\n\\n    def empty_read(self,rd):  # in case an empty project is queried\\n        self.href = []\\n        self.id = []\\n        self.name = []\\n        self.project = []\\n\\n    def detail_read(self,rd):\\n        keys = rd.keys()\\n        m = len(keys)\\n\\n        for jj in range(m):\\n            setattr(self, keys[jj], rd[keys[jj]])\\n\\n    def long_list(self, rd, path, method, query, data):\\n        prior = rd['links'][0]['rel']\\n        # Normally .rel[0] is the next, and .rel[1] is prior. \\n        # If .rel[0] = prior, then you are at END_OF_LIST\\n        keys = rd['items'][0].keys()\\n        m = len(keys)\\n\\n        while prior == 'next':\\n            rd = api_call(rd['links'][0]['href'], method, query, data, flagFullPath=True)\\n            prior = rd['links'][0]['rel']\\n            n = len(rd['items'])\\n            for jj in range(m):\\n                temp = getattr(self, keys[jj])\\n                for ii in range(n):\\n                    temp.append(rd['items'][ii][keys[jj]])\\n                setattr(self, keys[jj], temp)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n#Cancer Genomics Cloud API Quickstart\n\n##1. Create a project\n\nAll work on the CGC is carried out inside a [project](doc:projects-on-the-cgc). For this task, we can either use a project that has already been created, or we can use the API to create one. Here we will create a new project: `TARGET_PROJECT`, which we set in the definitions above to be 'Quickstart_API'. However, since we want to first check that that the named project doesn't exist, we'll also GET a list of all projects that have already been created that you can access.\n\nThe project's name and description will also be sent in the call to create the project, and it's `billingGroup` will be set to [your Pilot Funds billing group](http://docs.cancergenomicscloud.org/docs/account-settings#section-payments). Note that we set the [project's tags](http://docs.cancergenomicscloud.org/docs/create-a-new-project#section-request-body) to ['TCGA'] to indicate that it contains Controlled Data.\n\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"if __name__ == \\\"__main__\\\":\\n    # Did you remember to change the AUTH_TOKEN?\\n    if AUTH_TOKEN == 'AUTH_TOKEN':\\n        print \\\"You need to replace 'AUTH_TOKEN' string with your actual token. Please fix it.\\\"\\n        exit()\\n    # list all billing groups on your account\\n    billingGroups = API('billing/groups')\\n    # Select the first billing group, this is \\\"Pilot_funds(USER_NAME)\\\"\\n    print billingGroups.name[0], \\\\\\n    'will be charged for this computation. Approximate price is $4 for example STAR RNA seq (n=1) \\\\n'\\n \\n    # list all projects you are part of\\n    existingProjects = API(path='projects')     # make sure your project doesn't already exist\\n \\n    # set up the information for your new project\\n    NewProject = {\\n            'billing_group': billingGroups.id[0],\\n            'description': \\\"A project created by the API Quickstart\\\",\\n            'name': TARGET_PROJECT,\\n            'tags': ['tcga']\\n    }\\n \\n    # Check to make sure your project doesn't already exist on the platform\\n    for ii,p_name in enumerate(existingProjects.name):\\n        if TARGET_PROJECT == p_name:\\n            FLAGS['targetFound'] = True\\n            break\\n \\n    # Make a shiny, new project\\n    if FLAGS['targetFound']:\\n        myProject = API(path=('projects/' + existingProjects.id[ii]))    \\n        # GET existing project details (we need them later)\\n    else:\\n        myProject = API(method='POST', data=NewProject, path='projects') \\n        # POST new project\\n        # (re)list all projects, to check that new project posted\\n        existingProjects = API(path='projects')\\n        # GET new project details (we will need them later)\\n        myProject = API(path=('projects/' + existingProjects.id[0]))    \\n        # GET new project details (we need them later)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n##2. Add files to the project\n\nHere we have shown three different options for adding data to a project:\n\n(a) [Copy files from an existing project using the API](http://docs.cancergenomicscloud.org/v1.0/docs/api-quickstart#section--a-copy-files-from-an-existing-project-via-the-api)\n(b) [Copy files from an existing project using the visual interface](http://docs.cancergenomicscloud.org/v1.0/docs/api-quickstart#section--b-copy-files-from-an-existing-project-using-the-visual-interface)\n(c) [Add files using the API and command line uploader.](http://docs.cancergenomicscloud.org/v1.0/docs/api-quickstart#section--c-upload-local-files-using-the-api-and-the-command-line-uploader)\n\n**Follow one of these methods only.**\n\n###(a) Copy files from an existing project via the API\n\nHere we will take advantage of the project that you will have created if you followed the CGC [QuickStart](doc:quickstart) , so, if you haven't yet followed that tutorial, go and do that first. Then you will have a project named 'Quickstart' that contains files we can use for our analysis.\n\nThe following code lets us look for the three files from that project and copy them over to our current project, API_QUICKSTART.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"if __name__ == \\\"__main__\\\":\\n    for ii,p_id in enumerate(existingProjects.id):\\n        if existingProjects.name[ii] == 'QuickStart':\\n            filesToCopy = API(('files?limit=100&project=' + p_id))\\n            break\\n \\n    # Don't make extra copies of files \\n  \\t# (loop through all files because we don't know what we want)\\n    # files currently in project\\n    myFiles = API(('files?limit=100&project=' + myProject.id))  \\n    \\n    for jj,f_name in enumerate(filesToCopy.name):\\n        # Conditional is HARDCODED for RNA Seq STAR workflow\\n        if f_name[-len(INPUT_EXT):] == INPUT_EXT or f_name[-len('sta'):] \\\\ \\n           == 'sta' or f_name[-len('gtf'):] == 'gtf':\\n              if f_name not in myFiles.name:\\n              # file currently not in project\\n                 api_call(path=(filesToCopy.href[jj] + '/actions/copy'), method='POST', \\\\\\n                    data={'project': myProject.id, 'name':f_name} ,flagFullPath=True)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n###(b) Copy files from an existing project using the visual interface\n\nAgain, this method takes advantage of the project that you will have created if you followed the [CGC Quickstart](http://docs.cancergenomicscloud.org/docs/quickstart). So, if you haven't yet followed that tutorial, go and do that first. Then you will have a project named 'Quickstart' that contains files we can use for our analysis.\n\nTo copy those files into your project 'API Quickstart' using the CGC visual interface:\n\n1. Select the project 'API Quickstart' that you have just created.\n2. Go to the **Files** tab, and click **Add Files**.\n3. On the left hand side, you will see a list of locations that you can add files from. Under projects you will see **'Quickstart'**. Click that project's name.\n4. Select the checkboxes next to the files in that project. Then click **Add to project**.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n###(c) Upload local files using the API and the command line uploader\n\nTo use this option you need to have the CGC **command line uploader** installed already. Details of the uploader [are available here](http://docs.cancergenomicscloud.org/docs/upload-via-the-command-line). If you are using this script to call the uploader, make sure to set up your `$AUTH_TOKEN`.\n\nYou first need find the IDs of your projects with:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"bin/cgc-uploader.sh --list-projects\",\n      \"language\": \"text\",\n      \"name\": \"Bash\"\n    }\n  ]\n}\n[/block]\nwhich will print:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"9e710b4e-148e-414f-99b0-26cfbc316719    Quickstart_API\\ne56092a9-482d-44fc-a98d-825a3c90c5d2    Quickstart\\n431d4397-8b7e-4d35-bb74-47865750aead    Open Data Project\",\n      \"language\": \"text\",\n      \"name\": \"Bash\"\n    }\n  ]\n}\n[/block]\nWe will copy the first string since it matches our project name. Then, add the following to the python script:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"print \\\"You need to install the command line uploader before proceeding\\\"\\nToUpload = ['G17498.TCGA-02-2483-01A-01R-1849-01.2.tar.gz','ucsc.hg19.fasta','human_hg19_genes_2014.gtf']\\nfor ii in range(len(ToUpload)):\\n    cmds = \\\"cd ~/cgc-uploader; bin/cgc-uploader.sh -p 0f90eae7-2a76-4332-a233-6d20990189b7 \\\" + \\\\\\n        \\\"/Users/digi/PycharmProjects/cgc_API/toUpload/\\\" + ToUpload[ii]   \\n    os.system(cmds)\\ndel cmds\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n\n[block:callout]\n{\n  \"type\": \"success\",\n  \"title\": \"File directory\",\n  \"body\": \"In the example code above, `/Users/digi/PycharmProjects/cgc_API/toUpload/` is the path of the directory containing files to upload. You should change this to the appropriate path on your own computer.\"\n}\n[/block]\nNow that your files are uploaded, it may be useful to set their metadata. For more information about metadata, please refer to the file metadata documentation page. Once the file is uploaded, we can use the [API call to set the file metadata](doc:modify-a-files-metadata). For this, we need to know the ID number of the file we just uploaded; this is the number used to identify the file with the API. We can obtain the file ID by running the [API call to list project files](http://docs.cancergenomicscloud.org/docs/list-files-in-a-project), which returns the names and IDs for all the files in the project.\n[block:callout]\n{\n  \"type\": \"success\",\n  \"body\": \"See the [API overview](http://docs.cancergenomicscloud.org/v1.0/docs/the-cgc-api#section-identifying-projects-users-apps-files-tasks-and-inputs) for more information on referring to files, projects and other objects on the CGC.\"\n}\n[/block]\nOnce we have the file's ID, we can move on to setting its metadata. This is done via the request **PUT /project/:project_id/file/:file_id**, replacing :project_id with the project's ID and :file_id with the file's ID. We include the metadata we want to set in the body of the request, in the form of a JSON dictionary. Below is an example of how this is done (replace with appropriate metadata for your own files):\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"singleFile = api_call(path=myFiles.href[1], flagFullPath=True) \\n# here we modify file #1, adapt appropriately\\n \\nmetadata = {           \\n     # this is made up metadata, adapt appropriately\\n    \\\"name\\\": singleFile['name'],\\n    \\\"library\\\":\\\"TEST\\\",\\n    \\\"file_type\\\": \\\"fastq\\\",\\n    \\\"sample\\\": \\\"example_human_Illumina\\\",\\n    \\\"seq_tech\\\": \\\"Illumina\\\",\\n    \\\"paired_end\\\": \\\"1\\\",\\n    'gender': \\\"female\\\",\\n    \\\"data_format\\\": \\\"awesome\\\"\\n}\\n  \\napi_call(path=(singleFile['href'] + '/metadata'), method='PATCH', \\\\\\n         data = metadata, flagFullPath=True)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n##3. Get a copy of the correct public workflow\n\nThere are more than 150 public apps available on the CGC. Here we query all of them, then copy the target workflow, `TARGET_APP`, which we set earlier to be RNA-seq Alignment -STAR for TCGA.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"if __name__ == \\\"__main__\\\":    \\n    myFiles = API(('files?limit=100&project=' + myProject.id))   \\n    # GET files LIST, regardless of upload method\\n \\n    # Add a workflow (copy it from another project or the public apps, \\n    # not looping through all apps, we know exactly what we want)\\n    allApps = API(path='apps?limit=100&visibility=public')   \\n    # long function call, currently 183\\n    myApps = API(path=('apps?limit=100&project=' + myProject.id))\\n    if TARGET_APP not in allApps.name:\\n        print(\\\"Target app (%s) does not exist in the public repository. Please check the spelling\\\" \\\\\\n              % (TARGET_APP))\\n    else:\\n        ii = allApps.name.index(TARGET_APP)\\n        if TARGET_APP not in myApps.name:         \\n            # app not already in project\\n            temp_name = allApps.href[ii].split('/')[-2] # copy app from public repository\\n            api_call(path=('apps/' + allApps.project[ii] + '/' + temp_name + '/actions/copy'), \\\\\\n                     method='POST', data={'project': myProject.id, 'name': TARGET_APP})\\n            myApps = API(path=('apps?limit=100&project=' + myProject.id))   # update project app list\\n    del allApps\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n## 4. Build a file processing list for your analysis\n\nIt's likely that you'll only have one input file and two reference files in your project. However, if multiple input files were imported, the following code will create a batch of tasks -- one for each file. This code builds the list of files:\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"if __name__ == \\\"__main__\\\": \\n    # Build .fileProcessing (inputs) and .fileIndex (references) lists [for workflow]\\n    FileProcList = ['Files to Process']\\n    Ind_GtfFile = None\\n    Ind_FastaFile = None\\n \\n    for ii,f_name in enumerate(myFiles.name):\\n        # this conditional is for 'RNA seq STAR alignment' in   \\t\\n        # Quickstart_API. _Adapt_ appropriately for other workflows\\n        if f_name[-len(INPUT_EXT):] == INPUT_EXT:           # input file\\n            FileProcList.append(ii)\\n        elif f_name[-len('gtf'):] == 'gtf':\\n            Ind_GtfFile = ii\\n        elif f_name[-len('sta'):] == 'sta':\\n            Ind_FastaFile = ii\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n##5. Format, create, and start your tasks\n\nNext we will iterate through the File Processing List `FileProcList` to generate one task for each input file.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"if __name__ == \\\"__main__\\\":\\n    myTaskList = [None]\\n    for ii,f_ind in enumerate(FileProcList[1:]):   \\n        # Start at 1 because FileProcList[0] is a header\\n        NewTask = {'description': 'APIs are awesome',\\n            'name': ('batch_task_' +  str(ii)),\\n            'app': (myApps.id[0]),    # ASSUMES only single workflow in project\\n            'project': myProject.id,\\n            'inputs': {\\n               'genomeFastaFiles': {   # .fasta reference file\\n                    'class': 'File',\\n                    'path': myFiles.id[Ind_FastaFile],\\n        \\n                              \\n               'name': myFiles.name[Ind_FastaFile]\\n                },\\n                'input_archive_file': {  # File Processing List\\n                    'class': 'File',\\n                    'path': myFiles.id[f_ind],\\n                    'name': myFiles.name[f_ind]\\n                },\\n                \\n              # .gtf reference file, !NOTE: this workflow expects a _list_ for this input\\n                'sjdbGTFfile': [\\n                   {\\n                    'class': 'File',\\n                    'path': myFiles.id[Ind_GtfFile],\\n                    'name': myFiles.name[Ind_GtfFile]\\n                   }\\n                ]\\n            }\\n        }\\n        # Create the tasks, run if FLAGS['startTasks']\\n        if FLAGS['startTasks']:\\n            myTask = api_call(method='POST', data=NewTask, path='tasks/', query={'action': 'run'})        # task created and run\\n            myTaskList.append(myTask['href'])\\n        else:\\n            myTask = api_call(method='POST', data=NewTask, path='tasks/')    # task created and run\\n    myTaskList.pop(0)\\n \\n    print(\\\"%i tasks have been created. \\\\n\\\" % (ii+1))\\n    print(\\\"Enjoy a break, come back to us once you've got an email that tasks are done\\\")\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n##6. Check task completion\n\nThese tasks may take a long time to complete. Here are two ways to check in on them:\n\n(a) Wait for email confirmation\n\nNo additional code is needed. An email will be sent to with the status of your task when it completes.\n\n(b) Poll task status\nThe following script will poll the task every 10 minutes and report back when it has completed.\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"if __name__ == \\\"__main__\\\": \\n    # if tasks were started, check if they've finished\\n    for href in myTaskList:\\n        # check on one task at a time, if any running, can not continue (no sense to query others)\\n        print(\\\"Pinging CGC for task completion, will download files once all tasks completed.\\\")\\n        FLAGS['taskRunning'] = True\\n        while FLAGS['taskRunning']:\\n            task = api_call(path=href, flagFullPath=True)\\n            if task['status'] == 'COMPLETED':\\n                FLAGS['taskRunning'] = False\\n            elif task['status'] == 'FAILED':  # NOTE: leave loop on ANY failure\\n               print \\\"Task failed, can not continue\\\"\\n                exit()\\n            timer.sleep(600)\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n##7. Download Files\n\nIt may be useful to quickly download some summary files to visualize the results.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n###Visualize files on the CGC visual interface\n\nTo visualize the files produced by your task:\n1. Log in to the CGC, and go to the **Quickstart_API** project\n2. Click on the **Files** tab and select the files produced by the task. Clicking on any file will bring up its metadata and an option to visualize it. There is also an option to download the file.\n\n<div align=\"right\"><a href=\"#top\">top</a></div>\n\n###Download files via the API\n\nYou can do this by iterating through your `myFiles` list\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"from urllib2 import urlopen\\nimport os\\n  \\ndef download_files(fileList):\\n    # download a list of files from URLs\\n    dl_dir = 'downloads/'\\n    try:                    # make sure we have the download directory\\n        os.stat(dl_dir)\\n    except:\\n        os.mkdir(dl_dir)\\n  \\n    for ii in range(1, len(fileList)):  # skip first [0] entry, it is a text header\\n        url = fileList[ii]\\n        file_name = url.split('/')[-1]\\n        file_name = file_name.split('?')[0]\\n        file_name = file_name.split('%2B')[1]\\n        u = urlopen(url)\\n        f = open((dl_dir + file_name), 'wb')\\n        meta = u.info()\\n        file_size = int(meta.getheaders(\\\"Content-Length\\\")[0])\\n        print \\\"Downloading: %s Bytes: %s\\\" % (file_name, file_size)\\n  \\n        file_size_dl = 0\\n        block_sz = 1024*1024\\n        prior_percent = 0\\n        while True:\\n            buffer = u.read(block_sz)\\n            if not buffer:\\n                break\\n            file_size_dl += len(buffer)\\n            f.write(buffer)\\n            status = r\\\"%10d  [%3.2f%%]\\\" % (file_size_dl, file_size_dl * 100. / file_size)\\n            status = status + chr(8)*(len(status)+1)\\n            if (file_size_dl * 100. / file_size) > (prior_percent+20):\\n                print status + '\\\\n'\\n                prior_percent = (file_size_dl * 100. / file_size)\\n        f.close()\\n  \\n# Check which files have been generated (only taking small files to avoid long times)\\nmyNewFiles = API(('files?project=' + myProject.id))  # calling again to see what was generated\\ndlList = [\\\"links to file downloads\\\"]\\n \\nfor ii, f_name in enumerate(myNewFiles.name):\\n    # downloading only the summary files. Adapt for whichever files you need\\n    if (f_name[-4:] == '.out'):\\n        dlList.append(api_call(path=('files/' + myNewFiles.id[ii] + '/download_info'))['url'])\\nT0 = timer.time()\\ndownload_files(dlList)\\nprint timer.time() - T0, \\\"seconds download time\\\"\",\n      \"language\": \"python\"\n    }\n  ]\n}\n[/block]\nGood luck and have fun!\n\n<div align=\"right\"><a href=\"#top\">top</a></div>","excerpt":"<a name=\"top\"></a>","slug":"api-quickstart","type":"basic","title":"API Quickstart"}

API Quickstart

<a name="top"></a>

This Quickstart Guide leads you through a simple RNA sequencing analysis. It uses the CGC API, but its steps parallel the [Quickstart for the visual interface](http://docs.cancergenomicscloud.org/docs/quickstart). We have written this example in Python, but the concepts can be adapted to your preferred programming language. We encourage you to try this analysis yourself, as an aid to creating a script for your own custom analysis. The [documentation for the CGC API ](http://docs.cancergenomicscloud.org/docs/the-cgc-api)is available here. #Objective: We will use the API to create a project, add files to it, add a workflow, create and run a task, then download the outputs. [block:callout] { "type": "info", "title": "Scripts", "body": "An ipython notebook for the python script built in this tutorial is available here: https://github.com/sbg/docs/blob/master/cgc/Cancer_Genomics_Cloud_API/CGC_API_quickstart.py\n\nYou can also see the python script used in this Quickstart here: https://github.com/sbg/docs/blob/master/cgc/Cancer_Genomics_Cloud_API/CGC_API_quickstart.py" } [/block] [block:callout] { "type": "warning", "title": "On this page", "body": "* [Requirements](#section-requirements)\n* [Preparatory work](#section-requirements)\n* [Cancer Genomics Cloud API Quickstart](#section-cancer-genomics-cloud-api-quickstart)\n * 1 [Create a project](#section-cancer-genomics-cloud-api-quickstart)\n * 2 [Add files to the project](http://docs.cancergenomicscloud.org/v1.0/docs/api-quickstart#section-2-add-files-to-the-project)\n * (a) [Copy files from an existing project via the API](#section-cancer-genomics-cloud-api-quickstart)\n * (b) [Copy files from an existing project using the visual interface](#section--b-copy-files-from-an-existing-project-using-the-visual-interface)\n * (c) [Upload local files using the API and the command line uploader](#section--c-upload-local-files-using-the-api-and-the-command-line-uploader)\n * 3 [Get a copy of the correct public workflow](#section-3-get-a-copy-of-the-correct-public-workflow)\n * 4 [Build a file processing list for your analysis](#section-4-build-a-file-processing-list-for-your-analysis)\n * 5 [Format, create, and start your tasks](#section-5-format-create-and-start-your-tasks)\n * 6 [Check task completion](#section-6-check-task-completion)\n * 7 [Download Files](#section-7-download-files)\n * [Visualize files on the CGC visual interface](#section-visualize-files-on-the-cgc-visual-interface)\n * [Download files via the API ](#section-download-files-via-the-api)" } [/block] #Requirements To run the code in this tutorial, you will need: * **Python version 2.7.x** (Python 3.x is not 100% compatible with all the code used in this example) * **The Python requests module**. If you do not already have this, it can be installed via Python's package management system, pip: `$ sudo pip install requests` * **The authentication token associated with your account**, which you can get by going to [the Developer Dashboard ](https://cgc.sbgenomics.com/account/#developer)after logging into your account. Remember to keep your authentication token secure! * This project analyzes [TCGA Controlled Data](http://docs.cancergenomicscloud.org/docs/tcga-data-access) that is available on the CGC. To access the particular data file used here, you will need to have been awarded TCGA Controlled Data access through dbGaP. * We show three ways of adding the controlled data file to your project. You can choose your preferred method: * Find the file(s) you need with the **Case Explorer** and **Data Browser**. To learn more about this, follow the [QuickStart for the CGC visual interface](quickstart). * Copy the file(s) from another project that you are a member of using the **CGC API.** * Upload your own private data to analyze using the **CGC command line uploader**. <div align="right"><a href="#top">top</a></div> #Preparatory work To interact with the API, we send and receive data as JSON objects. Each JSON object received will represent one of the following: * A requested resource, or resources, listed in the items field in the JSON array, * An error, accompanied by text detailing the error in the message field. Most of the `GET`, `POST` and `PATCH` requests will only signal their success or failure by means of an HTTP status code in the response. First we import the necessary Python libraries and define the names of our new project and the desired workflow. [block:callout] { "type": "warning", "body": "In the code below, please replace the `AUTH_TOKEN` string with your authentication token!" } [/block] [block:code] { "codes": [ { "code": "# IMPORTS\nimport time as timer\nfrom requests import request\nimport json\nfrom urllib2 import urlopen\nimport os\n\n\n# GLOBALS\nFLAGS = {'targetFound': False, # target project exists in CGC project\n 'taskRunning': False, # task is still running\n 'startTasks': True # (False) create, but do NOT start tasks\n }\n# project we will create in CGC (Settings > Project name in GUI)\nTARGET_PROJECT = 'Quickstart_API' \nTARGET_APP = 'RNA-seq Alignment - STAR for TCGA PE tar' # app to use\nINPUT_EXT = 'tar.gz'\n\n# TODO: replace AUTH_TOKEN with yours here\nAUTH_TOKEN = 'AUTH_TOKEN' ", "language": "python" } ] } [/block] Since we are going to write the functions that interact with API in Python, we'll prepare a function that converts the information we send and receive into JSON. [block:code] { "codes": [ { "code": "# FUNCTIONS\ndef api_call(path, method='GET', query=None, data=None, flagFullPath=False):\n \"\"\" Translates all the HTTP calls to interface with the CGC\n\n This code adapted from the Seven Bridges platform API v1.1 example\n https://docs.sbgenomics.com/display/developerhub/Quickstart\n flagFullPath is novel, added to smoothly resolve pagination issues with the CGC API\"\"\"\n data = json.dumps(data) if isinstance(data, dict) or isinstance(data,list) else None\n base_url = 'https://cgc-api.sbgenomics.com/v2/'\n\n headers = {\n 'X-SBG-Auth-Token': AUTH_TOKEN,\n 'Accept': 'application/json',\n 'Content-type': 'application/json',\n }\n\n if flagFullPath:\n response = request(method, path, params=query, data=data, headers=headers)\n else:\n response = request(method, base_url + path, params=query, data=data, headers=headers)\n response_dict = json.loads(response.content) if response.content else {}\n\n if response.status_code / 100 != 2:\n print response_dict['message']\n raise Exception('Server responded with status code %s.' % response.status_code)\n return response_dict\n\ndef hello():\t# for debugging\n print(\"Is it me you're looking for?\")\n return True", "language": "python" } ] } [/block] We will not only create objects but also need to interact with them. So in this demo we also may use object oriented programming. We have created a class API, defined is below. Generally, the API calls will either return a list of things (e.g. `myFiles` is plural) or a very detailed description of one thing (e.g. `myFile` is singular). The appropriate structure is created automatically in the `response_to_fields()` method. [block:code] { "codes": [ { "code": "# CLASSES\nclass API(object):\n # making a class out of the api() function, adding other methods\n def __init__(self, path, method='GET', query=None, data=None, flagFullPath=False):\n self.flag = {'longList': False}\n response_dict = api_call(path, method, query, data, flagFullPath)\n self.response_to_fields(response_dict)\n\n if self.flag['longList']:\n self.long_list(response_dict, path, method, query, data)\n\n def response_to_fields(self,rd):\n if 'items' in rd.keys(): \n # get * {files, projects, tasks, apps} (object name plural)\n if len(rd['items']) > 0:\n self.list_read(rd)\n else:\n self.empty_read(rd)\n else: \n # get details about ONE {file, project, task, app} \n # (object name singular)\n self.detail_read(rd)\n\n def list_read(self,rd):\n n = len(rd['items'])\n keys = rd['items'][0].keys()\n m = len(keys)\n\n for jj in range(m):\n temp = [None]*n\n for ii in range(n):\n temp[ii] = rd['items'][ii][keys[jj]]\n setattr(self, keys[jj], temp)\n\n if ('links' in rd.keys()) & (len(rd['links']) > 0):\n self.flag['longList'] = True\n\n def empty_read(self,rd): # in case an empty project is queried\n self.href = []\n self.id = []\n self.name = []\n self.project = []\n\n def detail_read(self,rd):\n keys = rd.keys()\n m = len(keys)\n\n for jj in range(m):\n setattr(self, keys[jj], rd[keys[jj]])\n\n def long_list(self, rd, path, method, query, data):\n prior = rd['links'][0]['rel']\n # Normally .rel[0] is the next, and .rel[1] is prior. \n # If .rel[0] = prior, then you are at END_OF_LIST\n keys = rd['items'][0].keys()\n m = len(keys)\n\n while prior == 'next':\n rd = api_call(rd['links'][0]['href'], method, query, data, flagFullPath=True)\n prior = rd['links'][0]['rel']\n n = len(rd['items'])\n for jj in range(m):\n temp = getattr(self, keys[jj])\n for ii in range(n):\n temp.append(rd['items'][ii][keys[jj]])\n setattr(self, keys[jj], temp)", "language": "python" } ] } [/block] <div align="right"><a href="#top">top</a></div> #Cancer Genomics Cloud API Quickstart ##1. Create a project All work on the CGC is carried out inside a [project](doc:projects-on-the-cgc). For this task, we can either use a project that has already been created, or we can use the API to create one. Here we will create a new project: `TARGET_PROJECT`, which we set in the definitions above to be 'Quickstart_API'. However, since we want to first check that that the named project doesn't exist, we'll also GET a list of all projects that have already been created that you can access. The project's name and description will also be sent in the call to create the project, and it's `billingGroup` will be set to [your Pilot Funds billing group](http://docs.cancergenomicscloud.org/docs/account-settings#section-payments). Note that we set the [project's tags](http://docs.cancergenomicscloud.org/docs/create-a-new-project#section-request-body) to ['TCGA'] to indicate that it contains Controlled Data. [block:code] { "codes": [ { "code": "if __name__ == \"__main__\":\n # Did you remember to change the AUTH_TOKEN?\n if AUTH_TOKEN == 'AUTH_TOKEN':\n print \"You need to replace 'AUTH_TOKEN' string with your actual token. Please fix it.\"\n exit()\n # list all billing groups on your account\n billingGroups = API('billing/groups')\n # Select the first billing group, this is \"Pilot_funds(USER_NAME)\"\n print billingGroups.name[0], \\\n 'will be charged for this computation. Approximate price is $4 for example STAR RNA seq (n=1) \\n'\n \n # list all projects you are part of\n existingProjects = API(path='projects') # make sure your project doesn't already exist\n \n # set up the information for your new project\n NewProject = {\n 'billing_group': billingGroups.id[0],\n 'description': \"A project created by the API Quickstart\",\n 'name': TARGET_PROJECT,\n 'tags': ['tcga']\n }\n \n # Check to make sure your project doesn't already exist on the platform\n for ii,p_name in enumerate(existingProjects.name):\n if TARGET_PROJECT == p_name:\n FLAGS['targetFound'] = True\n break\n \n # Make a shiny, new project\n if FLAGS['targetFound']:\n myProject = API(path=('projects/' + existingProjects.id[ii])) \n # GET existing project details (we need them later)\n else:\n myProject = API(method='POST', data=NewProject, path='projects') \n # POST new project\n # (re)list all projects, to check that new project posted\n existingProjects = API(path='projects')\n # GET new project details (we will need them later)\n myProject = API(path=('projects/' + existingProjects.id[0])) \n # GET new project details (we need them later)", "language": "python" } ] } [/block] <div align="right"><a href="#top">top</a></div> ##2. Add files to the project Here we have shown three different options for adding data to a project: (a) [Copy files from an existing project using the API](http://docs.cancergenomicscloud.org/v1.0/docs/api-quickstart#section--a-copy-files-from-an-existing-project-via-the-api) (b) [Copy files from an existing project using the visual interface](http://docs.cancergenomicscloud.org/v1.0/docs/api-quickstart#section--b-copy-files-from-an-existing-project-using-the-visual-interface) (c) [Add files using the API and command line uploader.](http://docs.cancergenomicscloud.org/v1.0/docs/api-quickstart#section--c-upload-local-files-using-the-api-and-the-command-line-uploader) **Follow one of these methods only.** ###(a) Copy files from an existing project via the API Here we will take advantage of the project that you will have created if you followed the CGC [QuickStart](doc:quickstart) , so, if you haven't yet followed that tutorial, go and do that first. Then you will have a project named 'Quickstart' that contains files we can use for our analysis. The following code lets us look for the three files from that project and copy them over to our current project, API_QUICKSTART. [block:code] { "codes": [ { "code": "if __name__ == \"__main__\":\n for ii,p_id in enumerate(existingProjects.id):\n if existingProjects.name[ii] == 'QuickStart':\n filesToCopy = API(('files?limit=100&project=' + p_id))\n break\n \n # Don't make extra copies of files \n \t# (loop through all files because we don't know what we want)\n # files currently in project\n myFiles = API(('files?limit=100&project=' + myProject.id)) \n \n for jj,f_name in enumerate(filesToCopy.name):\n # Conditional is HARDCODED for RNA Seq STAR workflow\n if f_name[-len(INPUT_EXT):] == INPUT_EXT or f_name[-len('sta'):] \\ \n == 'sta' or f_name[-len('gtf'):] == 'gtf':\n if f_name not in myFiles.name:\n # file currently not in project\n api_call(path=(filesToCopy.href[jj] + '/actions/copy'), method='POST', \\\n data={'project': myProject.id, 'name':f_name} ,flagFullPath=True)", "language": "python" } ] } [/block] <div align="right"><a href="#top">top</a></div> ###(b) Copy files from an existing project using the visual interface Again, this method takes advantage of the project that you will have created if you followed the [CGC Quickstart](http://docs.cancergenomicscloud.org/docs/quickstart). So, if you haven't yet followed that tutorial, go and do that first. Then you will have a project named 'Quickstart' that contains files we can use for our analysis. To copy those files into your project 'API Quickstart' using the CGC visual interface: 1. Select the project 'API Quickstart' that you have just created. 2. Go to the **Files** tab, and click **Add Files**. 3. On the left hand side, you will see a list of locations that you can add files from. Under projects you will see **'Quickstart'**. Click that project's name. 4. Select the checkboxes next to the files in that project. Then click **Add to project**. <div align="right"><a href="#top">top</a></div> ###(c) Upload local files using the API and the command line uploader To use this option you need to have the CGC **command line uploader** installed already. Details of the uploader [are available here](http://docs.cancergenomicscloud.org/docs/upload-via-the-command-line). If you are using this script to call the uploader, make sure to set up your `$AUTH_TOKEN`. You first need find the IDs of your projects with: [block:code] { "codes": [ { "code": "bin/cgc-uploader.sh --list-projects", "language": "text", "name": "Bash" } ] } [/block] which will print: [block:code] { "codes": [ { "code": "9e710b4e-148e-414f-99b0-26cfbc316719 Quickstart_API\ne56092a9-482d-44fc-a98d-825a3c90c5d2 Quickstart\n431d4397-8b7e-4d35-bb74-47865750aead Open Data Project", "language": "text", "name": "Bash" } ] } [/block] We will copy the first string since it matches our project name. Then, add the following to the python script: [block:code] { "codes": [ { "code": "print \"You need to install the command line uploader before proceeding\"\nToUpload = ['G17498.TCGA-02-2483-01A-01R-1849-01.2.tar.gz','ucsc.hg19.fasta','human_hg19_genes_2014.gtf']\nfor ii in range(len(ToUpload)):\n cmds = \"cd ~/cgc-uploader; bin/cgc-uploader.sh -p 0f90eae7-2a76-4332-a233-6d20990189b7 \" + \\\n \"/Users/digi/PycharmProjects/cgc_API/toUpload/\" + ToUpload[ii] \n os.system(cmds)\ndel cmds", "language": "python" } ] } [/block] [block:callout] { "type": "success", "title": "File directory", "body": "In the example code above, `/Users/digi/PycharmProjects/cgc_API/toUpload/` is the path of the directory containing files to upload. You should change this to the appropriate path on your own computer." } [/block] Now that your files are uploaded, it may be useful to set their metadata. For more information about metadata, please refer to the file metadata documentation page. Once the file is uploaded, we can use the [API call to set the file metadata](doc:modify-a-files-metadata). For this, we need to know the ID number of the file we just uploaded; this is the number used to identify the file with the API. We can obtain the file ID by running the [API call to list project files](http://docs.cancergenomicscloud.org/docs/list-files-in-a-project), which returns the names and IDs for all the files in the project. [block:callout] { "type": "success", "body": "See the [API overview](http://docs.cancergenomicscloud.org/v1.0/docs/the-cgc-api#section-identifying-projects-users-apps-files-tasks-and-inputs) for more information on referring to files, projects and other objects on the CGC." } [/block] Once we have the file's ID, we can move on to setting its metadata. This is done via the request **PUT /project/:project_id/file/:file_id**, replacing :project_id with the project's ID and :file_id with the file's ID. We include the metadata we want to set in the body of the request, in the form of a JSON dictionary. Below is an example of how this is done (replace with appropriate metadata for your own files): [block:code] { "codes": [ { "code": "singleFile = api_call(path=myFiles.href[1], flagFullPath=True) \n# here we modify file #1, adapt appropriately\n \nmetadata = { \n # this is made up metadata, adapt appropriately\n \"name\": singleFile['name'],\n \"library\":\"TEST\",\n \"file_type\": \"fastq\",\n \"sample\": \"example_human_Illumina\",\n \"seq_tech\": \"Illumina\",\n \"paired_end\": \"1\",\n 'gender': \"female\",\n \"data_format\": \"awesome\"\n}\n \napi_call(path=(singleFile['href'] + '/metadata'), method='PATCH', \\\n data = metadata, flagFullPath=True)", "language": "python" } ] } [/block] <div align="right"><a href="#top">top</a></div> ##3. Get a copy of the correct public workflow There are more than 150 public apps available on the CGC. Here we query all of them, then copy the target workflow, `TARGET_APP`, which we set earlier to be RNA-seq Alignment -STAR for TCGA. [block:code] { "codes": [ { "code": "if __name__ == \"__main__\": \n myFiles = API(('files?limit=100&project=' + myProject.id)) \n # GET files LIST, regardless of upload method\n \n # Add a workflow (copy it from another project or the public apps, \n # not looping through all apps, we know exactly what we want)\n allApps = API(path='apps?limit=100&visibility=public') \n # long function call, currently 183\n myApps = API(path=('apps?limit=100&project=' + myProject.id))\n if TARGET_APP not in allApps.name:\n print(\"Target app (%s) does not exist in the public repository. Please check the spelling\" \\\n % (TARGET_APP))\n else:\n ii = allApps.name.index(TARGET_APP)\n if TARGET_APP not in myApps.name: \n # app not already in project\n temp_name = allApps.href[ii].split('/')[-2] # copy app from public repository\n api_call(path=('apps/' + allApps.project[ii] + '/' + temp_name + '/actions/copy'), \\\n method='POST', data={'project': myProject.id, 'name': TARGET_APP})\n myApps = API(path=('apps?limit=100&project=' + myProject.id)) # update project app list\n del allApps", "language": "python" } ] } [/block] <div align="right"><a href="#top">top</a></div> ## 4. Build a file processing list for your analysis It's likely that you'll only have one input file and two reference files in your project. However, if multiple input files were imported, the following code will create a batch of tasks -- one for each file. This code builds the list of files: [block:code] { "codes": [ { "code": "if __name__ == \"__main__\": \n # Build .fileProcessing (inputs) and .fileIndex (references) lists [for workflow]\n FileProcList = ['Files to Process']\n Ind_GtfFile = None\n Ind_FastaFile = None\n \n for ii,f_name in enumerate(myFiles.name):\n # this conditional is for 'RNA seq STAR alignment' in \t\n # Quickstart_API. _Adapt_ appropriately for other workflows\n if f_name[-len(INPUT_EXT):] == INPUT_EXT: # input file\n FileProcList.append(ii)\n elif f_name[-len('gtf'):] == 'gtf':\n Ind_GtfFile = ii\n elif f_name[-len('sta'):] == 'sta':\n Ind_FastaFile = ii", "language": "python" } ] } [/block] <div align="right"><a href="#top">top</a></div> ##5. Format, create, and start your tasks Next we will iterate through the File Processing List `FileProcList` to generate one task for each input file. [block:code] { "codes": [ { "code": "if __name__ == \"__main__\":\n myTaskList = [None]\n for ii,f_ind in enumerate(FileProcList[1:]): \n # Start at 1 because FileProcList[0] is a header\n NewTask = {'description': 'APIs are awesome',\n 'name': ('batch_task_' + str(ii)),\n 'app': (myApps.id[0]), # ASSUMES only single workflow in project\n 'project': myProject.id,\n 'inputs': {\n 'genomeFastaFiles': { # .fasta reference file\n 'class': 'File',\n 'path': myFiles.id[Ind_FastaFile],\n \n \n 'name': myFiles.name[Ind_FastaFile]\n },\n 'input_archive_file': { # File Processing List\n 'class': 'File',\n 'path': myFiles.id[f_ind],\n 'name': myFiles.name[f_ind]\n },\n \n # .gtf reference file, !NOTE: this workflow expects a _list_ for this input\n 'sjdbGTFfile': [\n {\n 'class': 'File',\n 'path': myFiles.id[Ind_GtfFile],\n 'name': myFiles.name[Ind_GtfFile]\n }\n ]\n }\n }\n # Create the tasks, run if FLAGS['startTasks']\n if FLAGS['startTasks']:\n myTask = api_call(method='POST', data=NewTask, path='tasks/', query={'action': 'run'}) # task created and run\n myTaskList.append(myTask['href'])\n else:\n myTask = api_call(method='POST', data=NewTask, path='tasks/') # task created and run\n myTaskList.pop(0)\n \n print(\"%i tasks have been created. \\n\" % (ii+1))\n print(\"Enjoy a break, come back to us once you've got an email that tasks are done\")", "language": "python" } ] } [/block] <div align="right"><a href="#top">top</a></div> ##6. Check task completion These tasks may take a long time to complete. Here are two ways to check in on them: (a) Wait for email confirmation No additional code is needed. An email will be sent to with the status of your task when it completes. (b) Poll task status The following script will poll the task every 10 minutes and report back when it has completed. [block:code] { "codes": [ { "code": "if __name__ == \"__main__\": \n # if tasks were started, check if they've finished\n for href in myTaskList:\n # check on one task at a time, if any running, can not continue (no sense to query others)\n print(\"Pinging CGC for task completion, will download files once all tasks completed.\")\n FLAGS['taskRunning'] = True\n while FLAGS['taskRunning']:\n task = api_call(path=href, flagFullPath=True)\n if task['status'] == 'COMPLETED':\n FLAGS['taskRunning'] = False\n elif task['status'] == 'FAILED': # NOTE: leave loop on ANY failure\n print \"Task failed, can not continue\"\n exit()\n timer.sleep(600)", "language": "python" } ] } [/block] <div align="right"><a href="#top">top</a></div> ##7. Download Files It may be useful to quickly download some summary files to visualize the results. <div align="right"><a href="#top">top</a></div> ###Visualize files on the CGC visual interface To visualize the files produced by your task: 1. Log in to the CGC, and go to the **Quickstart_API** project 2. Click on the **Files** tab and select the files produced by the task. Clicking on any file will bring up its metadata and an option to visualize it. There is also an option to download the file. <div align="right"><a href="#top">top</a></div> ###Download files via the API You can do this by iterating through your `myFiles` list [block:code] { "codes": [ { "code": "from urllib2 import urlopen\nimport os\n \ndef download_files(fileList):\n # download a list of files from URLs\n dl_dir = 'downloads/'\n try: # make sure we have the download directory\n os.stat(dl_dir)\n except:\n os.mkdir(dl_dir)\n \n for ii in range(1, len(fileList)): # skip first [0] entry, it is a text header\n url = fileList[ii]\n file_name = url.split('/')[-1]\n file_name = file_name.split('?')[0]\n file_name = file_name.split('%2B')[1]\n u = urlopen(url)\n f = open((dl_dir + file_name), 'wb')\n meta = u.info()\n file_size = int(meta.getheaders(\"Content-Length\")[0])\n print \"Downloading: %s Bytes: %s\" % (file_name, file_size)\n \n file_size_dl = 0\n block_sz = 1024*1024\n prior_percent = 0\n while True:\n buffer = u.read(block_sz)\n if not buffer:\n break\n file_size_dl += len(buffer)\n f.write(buffer)\n status = r\"%10d [%3.2f%%]\" % (file_size_dl, file_size_dl * 100. / file_size)\n status = status + chr(8)*(len(status)+1)\n if (file_size_dl * 100. / file_size) > (prior_percent+20):\n print status + '\\n'\n prior_percent = (file_size_dl * 100. / file_size)\n f.close()\n \n# Check which files have been generated (only taking small files to avoid long times)\nmyNewFiles = API(('files?project=' + myProject.id)) # calling again to see what was generated\ndlList = [\"links to file downloads\"]\n \nfor ii, f_name in enumerate(myNewFiles.name):\n # downloading only the summary files. Adapt for whichever files you need\n if (f_name[-4:] == '.out'):\n dlList.append(api_call(path=('files/' + myNewFiles.id[ii] + '/download_info'))['url'])\nT0 = timer.time()\ndownload_files(dlList)\nprint timer.time() - T0, \"seconds download time\"", "language": "python" } ] } [/block] Good luck and have fun! <div align="right"><a href="#top">top</a></div>