API Quickstart

This Quickstart Guide leads you through a simple RNA sequencing analysis. It uses the CGC API, but its steps parallel the Quickstart for the visual interface. We have written this example in Python, but the concepts can be adapted to your preferred programming language. We encourage you to try this analysis yourself, as an aid to creating a script for your own custom analysis. The documentation for the CGC API is available here.

Objective:

We will use the API to create a project, add files to it, add a workflow, create and run a task, then download the outputs.

📘

Scripts

An ipython notebook for the python script built in this tutorial is available here: https://github.com/sbg/docs/blob/master/cgc/Cancer_Genomics_Cloud_API/CGC_API_quickstart.py

You can also see the python script used in this Quickstart here: https://github.com/sbg/docs/blob/master/cgc/Cancer_Genomics_Cloud_API/CGC_API_quickstart.py

🚧

On this page

Requirements

To run the code in this tutorial, you will need:

  • Python version 2.7.x (Python 3.x is not 100% compatible with all the code used in this example)
  • The Python requests module. If you do not already have this, it can be installed via Python's package management system, pip: $ sudo pip install requests
  • The authentication token associated with your account, which you can get by going to the Developer Dashboard after logging into your account. Remember to keep your authentication token secure!
  • This project analyzes TCGA Controlled Data that is available on the CGC. To access the particular data file used here, you will need to have been awarded TCGA Controlled Data access through dbGaP.
  • We show three ways of adding the controlled data file to your project. You can choose your preferred method:
  • Find the file(s) you need with the Case Explorer and Data Browser. To learn more about this, follow the QuickStart for the CGC visual interface.
  • Copy the file(s) from another project that you are a member of using the CGC API.
  • Upload your own private data to analyze using the CGC command line uploader.

Preparatory work

To interact with the API, we send and receive data as JSON objects. Each JSON object received will represent one of the following:

  • A requested resource, or resources, listed in the items field in the JSON array,
  • An error, accompanied by text detailing the error in the message field.
    Most of the GET, POST and PATCH requests will only signal their success or failure by means of an HTTP status code in the response.
    First we import the necessary Python libraries and define the names of our new project and the desired workflow.

🚧

In the code below, please replace the AUTH_TOKEN string with your authentication token!

# IMPORTS
import time as timer
from requests import request
import json
from urllib2 import urlopen
import os


# GLOBALS
FLAGS = {'targetFound': False,                  # target project exists in CGC project
         'taskRunning': False,                  # task is still running
         'startTasks': True                     # (False) create, but do NOT start tasks
        }
# project we will create in CGC (Settings > Project name in GUI)
TARGET_PROJECT = 'Quickstart_API'               
TARGET_APP = 'RNA-seq Alignment - STAR for TCGA PE tar' # app to use
INPUT_EXT = 'tar.gz'

# TODO: replace AUTH_TOKEN with yours here
AUTH_TOKEN = 'AUTH_TOKEN'

Since we are going to write the functions that interact with API in Python, we'll prepare a function that converts the information we send and receive into JSON.

#  FUNCTIONS
def api_call(path, method='GET', query=None, data=None, flagFullPath=False):
    """ Translates all the HTTP calls to interface with the CGC

    This code adapted from the Seven Bridges platform API v1.1 example
    https://docs.sbgenomics.com/display/developerhub/Quickstart
    flagFullPath is novel, added to smoothly resolve pagination issues with the CGC API"""
    data = json.dumps(data) if isinstance(data, dict) or isinstance(data,list)  else None
    base_url = 'https://cgc-api.sbgenomics.com/v2/'

    headers = {
        'X-SBG-Auth-Token': AUTH_TOKEN,
        'Accept': 'application/json',
        'Content-type': 'application/json',
    }

    if flagFullPath:
        response = request(method, path, params=query, data=data, headers=headers)
    else:
        response = request(method, base_url + path, params=query, data=data, headers=headers)
    response_dict = json.loads(response.content) if response.content else {}

    if response.status_code / 100 != 2:
        print response_dict['message']
        raise Exception('Server responded with status code %s.' % response.status_code)
    return response_dict

def hello():	# for debugging
    print("Is it me you're looking for?")
    return True

We will not only create objects but also need to interact with them. So in this demo we also may use object oriented programming. We have created a class API, defined is below. Generally, the API calls will either return a list of things (e.g. myFiles is plural) or a very detailed description of one thing (e.g. myFile is singular). The appropriate structure is created automatically in the response_to_fields() method.

#  CLASSES
class API(object):
    # making a class out of the api() function, adding other methods
    def __init__(self, path, method='GET', query=None, data=None, flagFullPath=False):
        self.flag = {'longList': False}
        response_dict = api_call(path, method, query, data, flagFullPath)
        self.response_to_fields(response_dict)

        if self.flag['longList']:
            self.long_list(response_dict, path, method, query, data)

    def response_to_fields(self,rd):
        if 'items' in rd.keys():  
            # get * {files, projects, tasks, apps} (object name plural)
            if len(rd['items']) > 0:
                self.list_read(rd)
            else:
                self.empty_read(rd)
        else:           
            # get details about ONE {file, project, task, app}  
            #  (object name singular)
            self.detail_read(rd)

    def list_read(self,rd):
        n = len(rd['items'])
        keys = rd['items'][0].keys()
        m = len(keys)

        for jj in range(m):
            temp = [None]*n
            for ii in range(n):
                temp[ii] = rd['items'][ii][keys[jj]]
            setattr(self, keys[jj], temp)

        if ('links' in rd.keys()) & (len(rd['links']) > 0):
            self.flag['longList'] = True

    def empty_read(self,rd):  # in case an empty project is queried
        self.href = []
        self.id = []
        self.name = []
        self.project = []

    def detail_read(self,rd):
        keys = rd.keys()
        m = len(keys)

        for jj in range(m):
            setattr(self, keys[jj], rd[keys[jj]])

    def long_list(self, rd, path, method, query, data):
        prior = rd['links'][0]['rel']
        # Normally .rel[0] is the next, and .rel[1] is prior. 
        # If .rel[0] = prior, then you are at END_OF_LIST
        keys = rd['items'][0].keys()
        m = len(keys)

        while prior == 'next':
            rd = api_call(rd['links'][0]['href'], method, query, data, flagFullPath=True)
            prior = rd['links'][0]['rel']
            n = len(rd['items'])
            for jj in range(m):
                temp = getattr(self, keys[jj])
                for ii in range(n):
                    temp.append(rd['items'][ii][keys[jj]])
                setattr(self, keys[jj], temp)

Cancer Genomics Cloud API Quickstart

1. Create a project

All work on the CGC is carried out inside a project. For this task, we can either use a project that has already been created, or we can use the API to create one. Here we will create a new project: TARGET_PROJECT, which we set in the definitions above to be 'Quickstart_API'. However, since we want to first check that that the named project doesn't exist, we'll also GET a list of all projects that have already been created that you can access.

The project's name and description will also be sent in the call to create the project, and it's billingGroup will be set to your Pilot Funds billing group. Note that we set the project's tags to ['TCGA'] to indicate that it contains Controlled Data.

if __name__ == "__main__":
    # Did you remember to change the AUTH_TOKEN?
    if AUTH_TOKEN == 'AUTH_TOKEN':
        print "You need to replace 'AUTH_TOKEN' string with your actual token. Please fix it."
        exit()
    # list all billing groups on your account
    billingGroups = API('billing/groups')
    # Select the first billing group, this is "Pilot_funds(USER_NAME)"
    print billingGroups.name[0], \
    'will be charged for this computation. Approximate price is $4 for example STAR RNA seq (n=1) \n'
 
    # list all projects you are part of
    existingProjects = API(path='projects')     # make sure your project doesn't already exist
 
    # set up the information for your new project
    NewProject = {
            'billing_group': billingGroups.id[0],
            'description': "A project created by the API Quickstart",
            'name': TARGET_PROJECT,
            'tags': ['tcga']
    }
 
    # Check to make sure your project doesn't already exist on the platform
    for ii,p_name in enumerate(existingProjects.name):
        if TARGET_PROJECT == p_name:
            FLAGS['targetFound'] = True
            break
 
    # Make a shiny, new project
    if FLAGS['targetFound']:
        myProject = API(path=('projects/' + existingProjects.id[ii]))    
        # GET existing project details (we need them later)
    else:
        myProject = API(method='POST', data=NewProject, path='projects') 
        # POST new project
        # (re)list all projects, to check that new project posted
        existingProjects = API(path='projects')
        # GET new project details (we will need them later)
        myProject = API(path=('projects/' + existingProjects.id[0]))    
        # GET new project details (we need them later)

2. Add files to the project

Here we have shown three different options for adding data to a project:

(a) Copy files from an existing project using the API
(b) Copy files from an existing project using the visual interface
(c) Add files using the API and command line uploader.

Follow one of these methods only.

(a) Copy files from an existing project via the API

Here we will take advantage of the project that you will have created if you followed the CGC QuickStart , so, if you haven't yet followed that tutorial, go and do that first. Then you will have a project named 'Quickstart' that contains files we can use for our analysis.

The following code lets us look for the three files from that project and copy them over to our current project, API_QUICKSTART.

if __name__ == "__main__":
    for ii,p_id in enumerate(existingProjects.id):
        if existingProjects.name[ii] == 'QuickStart':
            filesToCopy = API(('files?limit=100&project=' + p_id))
            break
 
    # Don't make extra copies of files 
  	# (loop through all files because we don't know what we want)
    # files currently in project
    myFiles = API(('files?limit=100&project=' + myProject.id))  
    
    for jj,f_name in enumerate(filesToCopy.name):
        # Conditional is HARDCODED for RNA Seq STAR workflow
        if f_name[-len(INPUT_EXT):] == INPUT_EXT or f_name[-len('sta'):] \ 
           == 'sta' or f_name[-len('gtf'):] == 'gtf':
              if f_name not in myFiles.name:
              # file currently not in project
                 api_call(path=(filesToCopy.href[jj] + '/actions/copy'), method='POST', \
                    data={'project': myProject.id, 'name':f_name} ,flagFullPath=True)

(b) Copy files from an existing project using the visual interface

Again, this method takes advantage of the project that you will have created if you followed the CGC Quickstart. So, if you haven't yet followed that tutorial, go and do that first. Then you will have a project named 'Quickstart' that contains files we can use for our analysis.

To copy those files into your project 'API Quickstart' using the CGC visual interface:

  1. Select the project 'API Quickstart' that you have just created.
  2. Go to the Files tab, and click Add Files.
  3. On the left hand side, you will see a list of locations that you can add files from. Under projects you will see 'Quickstart'. Click that project's name.
  4. Select the checkboxes next to the files in that project. Then click Add to project.

(c) Upload local files using the API and the command line uploader

To use this option you need to have the CGC command line uploader installed already. Details of the uploader are available here. If you are using this script to call the uploader, make sure to set up your $AUTH_TOKEN.

You first need find the IDs of your projects with:

bin/cgc-uploader.sh --list-projects

which will print:

9e710b4e-148e-414f-99b0-26cfbc316719    Quickstart_API
e56092a9-482d-44fc-a98d-825a3c90c5d2    Quickstart
431d4397-8b7e-4d35-bb74-47865750aead    Open Data Project

We will copy the first string since it matches our project name. Then, add the following to the python script:

print "You need to install the command line uploader before proceeding"
ToUpload = ['G17498.TCGA-02-2483-01A-01R-1849-01.2.tar.gz','ucsc.hg19.fasta','human_hg19_genes_2014.gtf']
for ii in range(len(ToUpload)):
    cmds = "cd ~/cgc-uploader; bin/cgc-uploader.sh -p 0f90eae7-2a76-4332-a233-6d20990189b7 " + \
        "/Users/digi/PycharmProjects/cgc_API/toUpload/" + ToUpload[ii]   
    os.system(cmds)
del cmds

👍

File directory

In the example code above, /Users/digi/PycharmProjects/cgc_API/toUpload/ is the path of the directory containing files to upload. You should change this to the appropriate path on your own computer.

Now that your files are uploaded, it may be useful to set their metadata. For more information about metadata, please refer to the file metadata documentation page. Once the file is uploaded, we can use the API call to set the file metadata. For this, we need to know the ID number of the file we just uploaded; this is the number used to identify the file with the API. We can obtain the file ID by running the API call to list project files, which returns the names and IDs for all the files in the project.

👍

See the API overview for more information on referring to files, projects and other objects on the CGC.

Once we have the file's ID, we can move on to setting its metadata. This is done via the request PUT /project/:project_id/file/:file_id, replacing :project_id with the project's ID and :file_id with the file's ID. We include the metadata we want to set in the body of the request, in the form of a JSON dictionary. Below is an example of how this is done (replace with appropriate metadata for your own files):

singleFile = api_call(path=myFiles.href[1], flagFullPath=True) 
# here we modify file #1, adapt appropriately
 
metadata = {           
     # this is made up metadata, adapt appropriately
    "name": singleFile['name'],
    "library":"TEST",
    "file_type": "fastq",
    "sample": "example_human_Illumina",
    "seq_tech": "Illumina",
    "paired_end": "1",
    'gender': "female",
    "data_format": "awesome"
}
  
api_call(path=(singleFile['href'] + '/metadata'), method='PATCH', \
         data = metadata, flagFullPath=True)

3. Get a copy of the correct public workflow

There are more than 150 public apps available on the CGC. Here we query all of them, then copy the target workflow, TARGET_APP, which we set earlier to be RNA-seq Alignment -STAR for TCGA.

if __name__ == "__main__":    
    myFiles = API(('files?limit=100&project=' + myProject.id))   
    # GET files LIST, regardless of upload method
 
    # Add a workflow (copy it from another project or the public apps, 
    # not looping through all apps, we know exactly what we want)
    allApps = API(path='apps?limit=100&visibility=public')   
    # long function call, currently 183
    myApps = API(path=('apps?limit=100&project=' + myProject.id))
    if TARGET_APP not in allApps.name:
        print("Target app (%s) does not exist in the public repository. Please check the spelling" \
              % (TARGET_APP))
    else:
        ii = allApps.name.index(TARGET_APP)
        if TARGET_APP not in myApps.name:         
            # app not already in project
            temp_name = allApps.href[ii].split('/')[-2] # copy app from public repository
            api_call(path=('apps/' + allApps.project[ii] + '/' + temp_name + '/actions/copy'), \
                     method='POST', data={'project': myProject.id, 'name': TARGET_APP})
            myApps = API(path=('apps?limit=100&project=' + myProject.id))   # update project app list
    del allApps

4. Build a file processing list for your analysis

It's likely that you'll only have one input file and two reference files in your project. However, if multiple input files were imported, the following code will create a batch of tasks -- one for each file. This code builds the list of files:

if __name__ == "__main__": 
    # Build .fileProcessing (inputs) and .fileIndex (references) lists [for workflow]
    FileProcList = ['Files to Process']
    Ind_GtfFile = None
    Ind_FastaFile = None
 
    for ii,f_name in enumerate(myFiles.name):
        # this conditional is for 'RNA seq STAR alignment' in   	
        # Quickstart_API. _Adapt_ appropriately for other workflows
        if f_name[-len(INPUT_EXT):] == INPUT_EXT:           # input file
            FileProcList.append(ii)
        elif f_name[-len('gtf'):] == 'gtf':
            Ind_GtfFile = ii
        elif f_name[-len('sta'):] == 'sta':
            Ind_FastaFile = ii

5. Format, create, and start your tasks

Next we will iterate through the File Processing List FileProcList to generate one task for each input file.

if __name__ == "__main__":
    myTaskList = [None]
    for ii,f_ind in enumerate(FileProcList[1:]):   
        # Start at 1 because FileProcList[0] is a header
        NewTask = {'description': 'APIs are awesome',
            'name': ('batch_task_' +  str(ii)),
            'app': (myApps.id[0]),    # ASSUMES only single workflow in project
            'project': myProject.id,
            'inputs': {
               'genomeFastaFiles': {   # .fasta reference file
                    'class': 'File',
                    'path': myFiles.id[Ind_FastaFile],
        
                              
               'name': myFiles.name[Ind_FastaFile]
                },
                'input_archive_file': {  # File Processing List
                    'class': 'File',
                    'path': myFiles.id[f_ind],
                    'name': myFiles.name[f_ind]
                },
                
              # .gtf reference file, !NOTE: this workflow expects a _list_ for this input
                'sjdbGTFfile': [
                   {
                    'class': 'File',
                    'path': myFiles.id[Ind_GtfFile],
                    'name': myFiles.name[Ind_GtfFile]
                   }
                ]
            }
        }
        # Create the tasks, run if FLAGS['startTasks']
        if FLAGS['startTasks']:
            myTask = api_call(method='POST', data=NewTask, path='tasks/', query={'action': 'run'})        # task created and run
            myTaskList.append(myTask['href'])
        else:
            myTask = api_call(method='POST', data=NewTask, path='tasks/')    # task created and run
    myTaskList.pop(0)
 
    print("%i tasks have been created. \n" % (ii+1))
    print("Enjoy a break, come back to us once you've got an email that tasks are done")

6. Check task completion

These tasks may take a long time to complete. Here are two ways to check in on them:

(a) Wait for email confirmation

No additional code is needed. An email will be sent to with the status of your task when it completes.

(b) Poll task status
The following script will poll the task every 10 minutes and report back when it has completed.

if __name__ == "__main__": 
    # if tasks were started, check if they've finished
    for href in myTaskList:
        # check on one task at a time, if any running, can not continue (no sense to query others)
        print("Pinging CGC for task completion, will download files once all tasks completed.")
        FLAGS['taskRunning'] = True
        while FLAGS['taskRunning']:
            task = api_call(path=href, flagFullPath=True)
            if task['status'] == 'COMPLETED':
                FLAGS['taskRunning'] = False
            elif task['status'] == 'FAILED':  # NOTE: leave loop on ANY failure
               print "Task failed, can not continue"
                exit()
            timer.sleep(600)

7. Download Files

It may be useful to quickly download some summary files to visualize the results.

Visualize files on the CGC visual interface

To visualize the files produced by your task:

  1. Log in to the CGC, and go to the Quickstart_API project
  2. Click on the Files tab and select the files produced by the task. Clicking on any file will bring up its metadata and an option to visualize it. There is also an option to download the file.

Download files via the API

You can do this by iterating through your myFiles list

from urllib2 import urlopen
import os
  
def download_files(fileList):
    # download a list of files from URLs
    dl_dir = 'downloads/'
    try:                    # make sure we have the download directory
        os.stat(dl_dir)
    except:
        os.mkdir(dl_dir)
  
    for ii in range(1, len(fileList)):  # skip first [0] entry, it is a text header
        url = fileList[ii]
        file_name = url.split('/')[-1]
        file_name = file_name.split('?')[0]
        file_name = file_name.split('%2B')[1]
        u = urlopen(url)
        f = open((dl_dir + file_name), 'wb')
        meta = u.info()
        file_size = int(meta.getheaders("Content-Length")[0])
        print "Downloading: %s Bytes: %s" % (file_name, file_size)
  
        file_size_dl = 0
        block_sz = 1024*1024
        prior_percent = 0
        while True:
            buffer = u.read(block_sz)
            if not buffer:
                break
            file_size_dl += len(buffer)
            f.write(buffer)
            status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
            status = status + chr(8)*(len(status)+1)
            if (file_size_dl * 100. / file_size) > (prior_percent+20):
                print status + '\n'
                prior_percent = (file_size_dl * 100. / file_size)
        f.close()
  
# Check which files have been generated (only taking small files to avoid long times)
myNewFiles = API(('files?project=' + myProject.id))  # calling again to see what was generated
dlList = ["links to file downloads"]
 
for ii, f_name in enumerate(myNewFiles.name):
    # downloading only the summary files. Adapt for whichever files you need
    if (f_name[-4:] == '.out'):
        dlList.append(api_call(path=('files/' + myNewFiles.id[ii] + '/download_info'))['url'])
T0 = timer.time()
download_files(dlList)
print timer.time() - T0, "seconds download time"

Good luck and have fun!