Cancer Genomics Cloud

Comprehensive tips for reliable and efficient analysis set-up

Objective

We have prepared this guide to help you get up and running with your first set of projects on the CGC. Each section has specific examples and instructions to demonstrate how to accomplish each step. We also highlighted potential stumbling blocks so you can avoid them as you get set up. If you need more information on a particular subject, our Knowledge Center has additional information on all of the features of the CGC. Additionally, our support team is available 24/7 to help!

Helpful terms to know

Tool / App (interchangeably used) – refers to a bioinformatics tool or its Common Workflow Language (CWL) wrapper that is created or already available on the platform.

Workflow / Pipeline (interchangeably used) – denotes a number of apps connected together in order to perform multiple analysis steps in one run.

Task – represents an execution of a particular app or workflow on the platform. Depending on what is being executed (app or workflow), a single task can consist of only one tool execution (app case) or multiple executions (one or more per each app in the workflow).

Job – this term can practically be translated to the “execution” term from the “Task” definition (see above). It represents a single run of a single app found within a workflow. If you are coming from a computer science background, you’ll notice that the definition is quite similar to a common understanding of the term “job” (wikipedia). Except that the “job” is a component of a bigger unit of work called a “task” and not the other way around, as in some other areas may be the case. To further illustrate what job means on the platform, we can use a View stats & logs panel (button in the upper right corner on the task page) where we can visually inspect jobs after the task has been executed:

Example run of RNA-Seq Quantification
Figure 1. The jobs for an example run of RNA-Seq Quantification (HISAT2, StringTie) public workflow.

The green bars under the gray ones (apps) represent the jobs (Figure 1). As you can see, some apps (e.g. HISAT2_Build) consist of only one job, whereas others (e.g. HISAT2) contain multiple jobs that are executed simultaneously.

User Accounts & Billing Groups

Setting up a Cancer Genomics Cloud (CGC) account is free of charge - we encourage researchers to register here and try out one of the most advanced genomics analysis environments available.

Upon registration, each user receives $300 in credits in a "Pilot Fund" billing group to get started. Once those credits have been consumed, any user can create a paid billing group that is associated with a credit card or purchase order.

The work on CGC is divided into separate workspaces called projects (Figure 2). Each user can create projects and invite other registered users to be members of those projects. In addition, a variety of permission levels (e.g. admin, write, read-only) can be assigned to each project member.

Project organization structure
Figure 2. Project organization structure.

The user that creates the project sets the billing group to which the project’s storage and compute costs are billed, and this billing group can be changed at any time.

Further reading

Further Reading

Sign up for the CGC:
https://docs.cancergenomicscloud.org/docs/sign-up-for-the-cgc

Before you start:
https://docs.cancergenomicscloud.org/docs

Access via the visual interface:
https://docs.cancergenomicscloud.org/docs/access-via-the-visual-interface

Projects on the CGC:
https://docs.cancergenomicscloud.org/docs/projects-on-the-cgc

Add a collaborator to a project:
https://docs.cancergenomicscloud.org/docs/add-a-collaborator-to-a-project

Change the billing group for a project:
https://docs.cancergenomicscloud.org/docs/modify-project-settings

Account settings and billing:
https://docs.cancergenomicscloud.org/docs/account-settings-and-billing

Tips for Running Apps/Workflows

In this section you will be able to find some of the most essential information on how to handle apps and workflows on the platform. Creating CWL wrappers is not in the scope of this section, instead we focus on those platform features that are fundamental to making your analysis efficient and scalable. The content is structured such that it first offers information that is helpful for those researchers who want to stick to the public apps and workflows, and then it gradually advances toward the features which may be more useful when adjusting the existing tools or developing the new ones.

Start with the descriptions

Before executing a public app or a workflow (copied to your project from CGC Public Apps Gallery) the most important step is to thoroughly read the description. Even if you are familiar with the tools you are using or you regularly run similar tools on your local machine, the description is where you will find a lot of useful information on what you can expect in terms of the analysis design, expected analysis duration, expected cost, common issues, and more. The Common Issues and Important Notes section is an essential section in the description which gives insights into the known pitfalls you may come across when running the tool and should be studied the most, especially when planning to run large scale analysis.

Although we constantly work to improve the public apps and prevent users from using them incorrectly, it is still impossible to cover each and every case which can cause the failure. Here is an example on how overlooking the description notes caused unexpected outcomes for a researcher in the past:

Bioinformatics tools are usually designed to fail if certain inputs are configured incorrectly, and therefore terminate the misconfigured execution as soon as possible. However, in the case of VarDict variant caller, failing to provide the required input index files won’t always result in the failure, but can sometimes lead to infinite running.

This exact outcome occured when a user tried to process a number of samples by using the VarDict Single Sample Calling workflow. Previously, the user analyzed a whole cohort of samples with this workflow without any issues, and then tried to apply the same strategy to another dataset in another project. When setting up the workflow in the new project, the user copied the reference files to the new project and initiated the workflow with this new batch of files. Unfortunately, the user left out the reference genome index file (FAI file) during the copying step, in turn producing several tasks which suffered from infinite running. Since the index files are automatically loaded from the project and not referenced explicitly as separate inputs, it was even more likely for this issue to go unnoticed.

The situation was additionally compounded by having a batch task (see Batch Analysis section for more details) which contained hundreds of children tasks. Because the user wasn’t familiar with the related notes from the workflow description, they were not aware that the running tasks might in fact be hanging tasks. Had they known, they would have been able to terminate the run earlier, saving time and money.

In this case, the majority of the tasks failed early in the execution, producing minimal charges, but a few frozen tasks doubled the costs expected for the entire batch. We have since addressed the issues in this particular tool by introducing additional checks and input validations in the wrapper. Even though this is an extreme example, it demonstrates how descriptions can be critical to your tool working properly and can help you make informed decisions.

Test the workflow

Another great thing you can do to build the foundation for a successful large scale analysis is to first test the workflow on a small number (1-5) of runs. By testing your tools on a small scale, you can quickly make sure that everything will work as expected, while keeping the costs to a minimum. Additionally, testing the pipeline for the worst case scenario is one of the best ways to get more insight into the potential edge cases and know what to expect. This can often be accomplished by running the analysis with the files of largest size in your dataset. There are other situations where the highest computational load will correlate with the complexity of the sample content, which is more difficult to estimate. Familiarity with the experimental design that went into your sample files helps to create proper test cases, for both size and complexity, which can keep potential errors to a minimum.

Specify computational resources

The term “computational instance” or shortly “instance” refers to a virtual machine that can be chosen in order to adjust CPU and memory capacities for the particular application. The instances can be selected from the predefined list of Amazon EC2 or Google Cloud instance types.

For public apps, the instance resources have been pre-tested and defined by the Seven Bridges team. In most cases, the pre-defined instance types and default parameters will work for the majority of workflows, but you may need to optimize for certain input sizes or complexities, especially when scaling up your workflow.

Small-scale testing can help inform whether or not you need to tweak the instance type(s) or specific tool parameters you’ll use throughout the analysis. There are two means by which instance selection is controlled:

  1. Resource parameters
  2. Instance selection

For the public apps and workflows, resource parameters are usually adjustable through the Memory per job and CPU per job parameters. Setting the values for these will cause the scheduling component to allocate the most optimal instance type given these values. The following is an example from the task page:

An example of a task containing the two resource parameters
Figure 3. An example of a task containing the two resource parameters. The value for the memory is in megabytes.

In this particular case, c5.4xlarge instance (32.0 GiB, 16 vCPUs) was automatically allocated based on the two highlighted resource parameters (Figure 3). If we take a moment to examine the other instance types which may fit this requirement, we may note that c4.4xlarge comes with 30.0 GiB and 16 vCPUs, which are the values closer to the given ones. However, the platform’s scheduling component also takes into account the price, and since c5.4xlarge is slightly cheaper than c4.4xlarge it was given precedence.

Another way to control the computational resources is to explicitly choose an instance type from the Execution Settings panel on the task page:

Selecting an instance type on the task page
Figure 4. Selecting an instance type from the drop-down menu on the task page.

Learn about Instance Profiles

To be able to get the most out of your computational resources, it is essential to know how your tools behave in different scenarios, e.g. for different input parameters or different data file sizes. One of the most efficient ways to catch these patterns and use them to optimize future analyses is to monitor resources profiles.

The following is an example of the figures accessible from the Instance metrics panel found within stats & logs page (View stats & logs button is located in the upper right corner on the task page):

Instance metrics
Figure 5. An example of instance profiles when running STAR 2.5.4b on Amazon c4.8xlarge (36 CPUs, 60GB of memory).

To learn more about Instance metrics, visit the related documentation page in our Knowledge Center.

Scale up with Batch Analysis

Batch analysis is an important step when analyzing TCGA or any other larger cohorts. We use batch analysis to create separate, analytically-identical tasks for each given item (either a file or a group of files depending on the batching method).

Two batching modes are available:

  1. Batch by File
  2. Batch by File metadata

In the Batch by File scenario, a task will be created for each file selected in the batched input. In the Batch by File Metadata scenario, a task will be created for each group of files that correspond to a specified value of the metadata (Figure 6).

Batch by metadata
Figure 6. When File metadata is selected from the dropdown menu on the task page, a pop-up window is opened where users can select the metadata they want to batch by. Here, variant calls from two different trio families are used and by selecting Family ID metadata (a field added for demonstration purposes), two tasks are created automatically.

Batch analysis is quite robust to errors or failures, in that failures in one task will not affect other tasks within the same batch of analysis. In other words, if we process multiple samples using batch analysis, each sample will have its own task. If one task fails, the other tasks will continue to process, and we will still be able to complete the analysis of most samples. However, we can alternatively decide to process multiple samples using another feature called Scatter (learn more on the following page), and have all the samples processed within a single task. In contrast to batch analysis, a failure of any part of the scatter task (i.e. a job for an individual sample), will cause the entire task to fail and thus the analysis for all samples will fail. In summary, batch analysis highly simplifies the process of creating a large number of the same tasks while allowing independent, simultaneous executions.

Important: Before starting large scale executions with Batch Analysis, please refer to the “Computational Limits” section below.

Parallelize with Scatter

Many common analyses will benefit from highly parallel executions, such as analyzing files from multiple TCGA disease types. The first factor that will determine the parallelization strategy is whether you want to execute the same tool multiple times within one computational instance, or use multiple, independent instances to run the whole analysis in parallel for different input files (i.e. batch analysis). The feature that enables the former is called Scatter.

How scatter affects resource utilization
Figure 7. Simplified schematic explaining how scatter affects resource utilization. Green bars represent CPUs in use. If we run only one job that requires one CPU, the other CPUs will remain unused (left), whereas if we run 8 such jobs, all CPUs will be utilized (right). The cost in both scenarios will be the same.

The following is a simple example to accompany Figure 7.

SBG Decompressor is a simple app used for extracting files from compressed bundles. It requires only one CPU and 1000 MB of memory to run. By default, the Amazon c4.2xlarge instance is chosen by the platform since the requirements for CPU and memory of this tool are less than or equal to the c4.2xlarge capacities. This means that if you have only one file to decompress, most of the resources on the c4.2xlarge instance will remain unused. However, if you have multiple files you would like to process, the best way to do this is to use the Scatter feature, and by doing so utilize the additional instance resources. By design, scattering will create a separate job (green bar) for each file (or a set of files, depending on the input type) provided on the scattered input.

Another common application of the scatter feature is seen when using the HaplotypeCaller tool to call SNPs and indels across large genomic intervals (e.g. entire chromosome set). Instead of searching through the whole genome, the HaplotypeCaller app has an option which allows for defining specific genomics regions (e.g. intervals) within which variants will be called. With this in mind, it is possible to set up the processing such that each chromosome is run independently from each other. This is a great example of a task that will benefit from scattering, as you can run the tool in parallel across all chromosomes. By choosing the “include_intervals” input from the Scatter box (Figure 8), the tool will perform scattering by the genomic intervals input, which will result in an independent job for each provided interval file (usually BED or TXT format which contains interval name along with the corresponding start and end genomic coordinates).

Setting a scattered input
Figure 8. To set scattered input, double click on the app of interest and select Step in the right-hand panel. The input you want to scatter can be selected under the “Scatter” section. Additionally, if that particular input is connected directly to the input node (there are no preceding apps in between), the input node needs to be modified (also by double clicking on the node) to accept an array of objects originally defined (e.g. array of files instead of one file). Once the input node is adjusted, the connection will turn orange. Here, GATK4 HaplotypeCaller input include_intervals is scattered, meaning that the independent job will be executed for each file provided on this input.

There are situations where we want to create one job per multiple input files, and for those situations, we need a way to bin the input files into a series of grouped items. A good example of this is when an input sample has paired-end reads, and we want to create jobs for each sample, not each read. In this case, SBG Pair by Metadata app (available as a public app) can be used to create the group for each pair, and this app should be inserted upstream of the tool on which you will apply scatter in your workflow.

If you recall the comparison between Batch Analysis and Scatter from the previous section, it may perhaps appear that these features are alternatives to one another. Even though they are intended for somewhat similar purposes, Scatter and Batch Analysis are mostly used as complementary features – they can be applied to the same analysis, as we can imagine for a HaplotypeCaller example that we have just discussed. The workflow already contains the HaplotypeCaller app which is scattered across intervals. If we run the workflow and batch by input_bam_cram input, we would be able to create an individual task per each BAM/CRAM that we provide on the input.

Configuring default computational resources

In a previous section, we reviewed controlling the computational resources from the task page, where you can choose an instance for the overall analysis on-the-fly. This may be exactly what you need in many cases, however, customizing the workflow for a specific scenario or optimizing it with Scatter will probably require setting up default instance types. This can be configured from the app/workflow editor by using the instance hint feature. This feature allows you to 1) select the appropriate resources for the individual steps (i.e. apps) in a workflow, and 2) define the default instance for the entire analysis (either a single app or a workflow).

The creator of the analysis can choose to set the default instance if they are confident that their analysis will work well on a particular instance and they don’t want to rely on the user settings or the platform scheduler. However, this is not mandatory – if the default instance is not set, the user can select the instance for the analysis from the task page (as described above), or the user can leave it to the platform scheduler to pick the right instance based on the resource requirements for the individual app(s).

If you are looking to set up an instance hint (i.e. chose a specific instance type) on the particular app in the workflow, you can do so by entering the editing mode, then double-clicking on the node and selecting Set Hints button under the Step tab as follows:

Setting instance hints
Figure 9. Setting instance hints for a particular app in the workflow.

After that, a configuration pop-up window will show up and you will be able to choose instance types. The same goes for managing the instance hints for the whole workflow, the only difference is that we get to the pop-up window through the drop-down menu in the upper right corner in the editor as shown in Figure 10:

Setting instance hints for the whole workflow
Figure 10. Setting instance hints for the whole workflow.

Finally, to set the instance hints for an individual app, edit the app and scroll down to the HINTS section shown in Figure 11:

Setting instance hints within apps
Figure 11. Setting instance hints within apps.

Now, if we recall resource parameters from earlier, one may ask “What happens if we have resource parameters set and an instance hint configured both within the same app?” In such cases, the scheduler will prioritize the instance hint value and will allocate the instance based on that information. However, the information about resource parameters needed for one job is not completely ignored. Namely, if we are about to run the workflow, and the set-up is such that multiple jobs can be executed in parallel on the same instance (i.e. we use the Scatter feature), resource parameters are also taken into consideration. They are not used for the instance allocation in that case, but they determine the number of different jobs that can be “packed” together for simultaneous run on the given instance. We provided a detailed explanation of the scatter feature that enables this in a previous section.

Further analysis and interpretation of your Results

On the CGC, you can further analyze your data by leveraging Python notebooks or R interactive capabilities by using the Data Cruncher feature. With the Data Cruncher, you can easily access the files in your project for use in an R- or Python-based environment, without the need to download them to your local machine, and with the added flexibility of choosing computational resources.

Getting started

To access the Data Cruncher:

  1. Navigate to the Interactive Analysis tab in the upper right corner (Figure 12).
  2. Click Open on the Data Cruncher card (Figure 12).
  3. After that, you will land on a page where you can click on the Create your first analysis button which will start an analysis setup dialog box (Figure 13)
Navigating to the Data Cruncher
Figure 12. Navigating to the Data Cruncher.

There are two computing environments available (Figure 13):

  • JupyterLab
  • RStudio

In this manual we will only focus on the JupyterLab environment. To learn more about RStudio, check out Data Cruncher documentation.

Regarding resources, there are several either Amazon Web Services (AWS) or Google instance types (depending on the Location preference during the project creation) that you can choose from. In addition, you can configure Suspend time as a safety mechanism to enable automatic termination of the instance after a period with no activity. There is also an option to turn off this feature completely if you plan a longer interaction and you do not want to risk termination in case you are focusing on something else while waiting on the results. We recommend that you carefully check the Location and Suspend Time settings before you start the analysis, especially if you are running an existing one which already has its own configuration (Figure 14).

After you set up the preferences, you can click on the Start the analysis button to spin up the instance. It may take several minutes for your instance to initialize.

Setting up an analysis in Data Cruncher
Figure 13. Setting up the interactive analysis in Data Cruncher.
Analysis settings panel
Figure 14. Analysis settings panel. The computational requirements can be edited when the analysis is inactive.

JupyterLab environment

Once the instance is ready, to access the editor, click on the Open in editor button from the analysis page (Figure 15).

aunching the editor
Figure 15. Launching the editor.

From there, you can select the environment of your choice and move forward with your analysis (Figure 16):

JupyterLab landing page
Figure 16. JupyterLab landing page. The list of analysis files on the left is still empty.

Accessing the files

The files you can analyze within the notebook are:

  • files present in the analysis – either files uploaded directly in the analysis workspace (your home folder for your interactive analysis) or files produced by the interactive analysis itself, and
  • files present in the project

The list of files available in the analysis is displayed in the left-hand panel under the Files tab. This is a list of items in the /sbgenomics/workspace directory, which is the default directory for any work that you do during your session. To control the content in this directory (create new folder, upload files etc.), you can use the workspace toolbar located above this panel (Figure 16).

However, you are likely interested in using an interactive analysis to access the data from your project. There are two ways to get the path from a file found within your project:

  1. From GUI – by clicking on the file you want to analyze in the Project Files tab (this action copies the path to clipboard) and pasting the path in the notebook (Figure 17)
  2. By listing all the files in the /sbgenomics/project-files directory and choosing the one you are interested in (see cell [4] in Figure 17).

This path is immutable across different projects. Therefore, if you copy the analysis to another project and you have referenced your file this way, it will work the same way in the new project.

Python 3 Notebook example
Figure 17. Python 3 Notebook example.

As you may have noticed from the previous example, we used “!” in two notebook cells to switch from Python to shell interpreter and hence denote that these cells should be executed as shell commands. If there is a need to use shell more intensively, e.g. for installation purposes or similar, the notebook environment can become impractical. Fortunately, there is an option to mitigate this: you can open up a terminal from the launcher page (Figure 16).

Saving the created files

Finally, when you are done with the analysis and you want to save the results to your project, you will need to go to the Files tab, right-click on the file(s) and select Save To Project. Alternatively, the files can be saved to the project by copying them to the `/sbgenomics/output-files directory` from the terminal. It is important to note that only smaller files (e.g. .ipynb files) will continue to live in the analysis after it has stopped. Hence, all other needed files need to be saved before you decide to terminate the session.

Further reading icon

Further reading

Data Cruncher documentation:
https://docs.cancergenomicscloud.org/docs/about-data-cruncher

About publicly available interactive analyses:
https://www.sevenbridges.com/data-cruncher-public-interactive-analyses/

YouTube demo:
https://www.youtube.com/watch?v=LP53HNyG9gs&t=6m35s

Data Browser Essentials

If you are coming to the CGC as a collaborator, it is likely that you are interested in examining some of the most popular public cancer datasets, such as The Cancer Genome Atlas (TCGA). To deal with huge numbers of samples and files which are available within these datasets (TCGA has over 314,000 files), the Seven Bridges team has developed the Data Browser feature. The Data Browser enables users to visually explore, query and filter data by its metadata attributes.

Quickstart

To get started with the Data Browser, you can go to the top navigation bar and choose Data Browser. Next, you will be prompted to select the dataset of interest or select multiple datasets in case you are interested in looking across multiple datasets at once.

Accessing Data Browser
Figure 18. Accessing Data Browser and choosing dataset(s) to explore.

From here, you can proceed to query the dataset(s), explore previously saved queries, or check out the example queries (Figure 19). By typing in some of the known keywords in the search bar (e.g. Lung Adenocarcinoma, miRNA-seq, etc.) you can quickly filter through the data and get your first query building blocks.

Query panel page in Data Browser
Figure 19. Landing on the query panel page.

In the following section, we will focus on understanding how queries can be built from scratch and how we can get exactly what we need by using the Data Browser.

Data Browser Querying Logic

While the Data Browser is an effective tool for getting files that you are interested in, there are a few aspects that can cause some confusion initially. Different arrangements of the same query blocks may produce different results, which may be unexpected. Indeed, querying logic may seem confusing and overwhelming, but we prepared a few examples which will clarify the most essential functions. All examples deal with TCGA GRCh38 dataset and represent querying from GUI. So, let’s dive in!

We will start from the simple, intuitive example:

Data Browser example 1 - Serial connection
Figure 20. Example 1 – Serial connection.

Now, everyone will likely agree about what this particular query (Figure 20) represents. Let’s try to translate this “diagram language” into words. The result of this query diagram would be:

“Find all the cases such that each case has at least one primary tumor sample which has at least one of EITHER RNA-Seq OR WXS experimental strategy file, then find the corresponding samples and the corresponding files.”

Sounds good for now, right?

Let’s examine a slightly different example:

Data Browser example 2 – Parallel File blocks
Figure 21. Example 2 – Parallel File blocks.

If we try to come up with the statement for this query, it would be something like this:

“Find all the cases such that each case has at least one primary tumor sample which has at least one RNA-Seq AND at least one WXS experimental strategy file, then find the corresponding samples and the corresponding files.”

It is a bit unusual for this parallel connection to represent the “AND” operation. To contrast, keep in mind that the “OR” operation is represented in the previous example. Correspondingly, the resulting numbers of files resulting from each query are greater in the first example, which uses an “OR” operation, then in the second example, which uses the “AND” operation.

Let’s examine a third option for a similar query:

Data Browser example 3 – Parallel Sample blocks
Figure 22. Example 3 – Parallel Sample blocks.

This query represents the following statement:

“Find all the cases such that each case has at least one primary tumor sample which has at least one RNA-Seq experimental strategy file AND at least one primary tumor sample which has at least one WXS experimental strategy file, then find the corresponding samples and the corresponding files.”

This query is less intuitive to understand, but if we follow the same logic as in the previous simpler queries, we can still find the correct result. We can find the following evidence to support that this interpretation of the query is correct: if we take a look at the returned counts below the graphs (Figure 21, 22), we notice that these numbers are very similar. The reason for this lies in the fact that the majority of these samples do contain at least one file from each given category. To be precise, there is only one case among all cases returned in the third query that doesn’t have both WXS and RNAseq experimental strategies in the same primary tumor sample.

Now that we understand Data Browser querying in detail, we may use this logic to create a query which is a very common starting point for many bioinformatics analyses: finding matched tumor and normal samples.

The following is an example using TCGA Breast Invasive Carcinoma cohort (TCGA BRCA):

Querying matched tumor/normal TCGA BRCA cohort
Figure 23. Querying matched tumor/normal TCGA BRCA cohort.

This query represents the following statement:

“Find all the cases such that each case has Breast Invasive Carcinoma (TCGA-BRCA) AND at least one primary tumor sample which has at least one WXS experimental strategy file in BAM format AND at least one solid tissue normal sample which has at least one WXS experimental strategy file in BAM format, then find the corresponding samples and the corresponding files.”

Simply put: each case needs to fulfill all three conditions in order to be pulled out (hence three branches).

It is important to mention when copying files into your project that BAI files will be copied together with the BAM files automatically, so you do not need to worry about finding the associated index files. Also, as it is explained in subsequent sections, you will not be billed for keeping these public files in your project. However, the storage for any derived files (e.g., FASTQ files derived from BAM files) will be charged.

Computational Limits

Before you decide to run a large number of simultaneous tasks, it is good to take a moment to learn more about platform limitations.

Because the Amazon account that we use for Cancer Genomics Cloud comes with a limit on the number of instances that can be allocated at the same time, we were forced to impose the restriction of the maximum number of running tasks per user. By doing so we are securing the computational resources availability on the platform at any given time and making sure that the users seeking highly parallelized analysis do not block all other users. This limit is currently set to 80 parallel tasks. In other words, if you create a project and trigger a batch analysis containing 100 tasks, you will get 80 running tasks whereas the remaining 20 tasks will have a “QUEUED” status until some of the 80 tasks finish.

If you need to run at a higher throughput, we would highly appreciate you reaching out to our team, so we can work together to find the best solution.

Storage Pricing

When you copy the files from the CGC data sets, TCGA or TARGET, to your project, you will not be charged for storing those files. The files from these publicly available datasets are located in a cloud bucket maintained by NCI’s Genomic Data Commons (GDC). These files are only referenced in your project, while the physical file remains in the GDC.

A similar logic applies to all files: if you copy a file from one project to another, that file is only referenced in the second project, without creating a physical copy and without additional storage expenses. This also means that if you create multiple copies and then delete the original file, the copies will still remain and the file will still accrue storage costs until ALL copies are deleted.

In addition to computation, storage costs are made up of the cost for storing derived files or any files that you have uploaded. It is important to note that this applies as long as you keep the files in your projects. To minimize storage costs, it is recommended to check on your files after each project phase is completed and discard any files which are not needed for the future analysis. Note that intermediate files can always be recreated by running the analysis again, and that the cost to re-create these files can sometimes be less than storing the files in your project. As noted in the previous paragraph, multiple copies of a file are charged only once. Consequently, when deleting files which have a copy in other projects, you will still accrue storage expenses until all the copies are removed.

If you have questions about moving your files to manage storage costs, please contact us at [email protected] to learn more about streamlining the process. For more information about storage and computational pricing, visit our Knowledge Center.

Appendix

Using the API

There are numerous use cases in which you might want to have more control over your analysis flow, or simply want to automate custom, repetitive tasks. Even though there are a lot of features enabled through the platform GUI, certain types of actions will benefit from using the API.

We will provide a couple of API Python code snippets for the purpose of this manual. These code snippets have proven to be useful for mitigating most frequent obstacles when running large scale analyses on Cancer Genomics Cloud.

Before you proceed, note that all examples are written in Python 3.

Install and configure

To be able to run Seven Bridges API, you should first install sevenbridges-python package:

pip install sevenbridges-python

After that, you can play around with some basic commands, just to get a sense of how the Seven Bridges API works. To be able to initialize the sevenbridges library, you’ll need to generate your authentication token and plug it into the following code:

import sevenbridges as sbg
api = sbg.Api(url='https://cgc-api.sbgenomics.com/v2', token='')

Managing tasks

Here is a simple code for fetching all task IDs from a given project and printing the status metrics:

import collections

# First you need to provide project slug from its URL:
test_project = 'username/project_name'

# Query all the tasks in the project
# Query commands may be time sensitive if there is a huge number of expected items
# See explanation below
tasks = api.tasks.query(project=test_project).all()
statuses = [t.status for t in list(tasks)]

# counter method counts the number of different elements in the list
counter = collections.Counter(statuses)
for key in counter:
    print(key, counter[key])

The previous code should result in something like this:

ABORTED 1
COMPLETED 8
FAILED 1
DRAFT 1

It is of great importance to stop here for a moment and provide an explanation of API rate limits.

The maximum number of API requests, i.e. commands performed on the api object, is 1000 requests in 5 minutes (300 seconds). An example of this command is the following line from the previous code snippet:

tasks = api.tasks.query(project=test_project).all()

By default, 50 tasks will be fetched in one request. However, the maximum number is 100 tasks and it can be configured as follows:

tasks = api.tasks.query(project=test_project, limit=100).all()

In other words, you would have to wait for the renewed requests for a couple of minutes if the project contained more than 100,000 tasks. This scenario almost never happens in practice since it is unusual to have this number of tasks in one project. However, this number can be easily reached for the project files to which the same rule applies. The next few code examples illustrate handling files.

Managing files

Let’s see how we can perform some simple actions on the project files. The following code will print out the number of files in the project:

files = api.files.query(project=test_project, limit=100).all()
len(list(files))

If we now want to read the metadata of a specific file we can run the following:

files = api.files.query(project=test_project, limit=100).all()
for f in list(files):
    if f.name == '95622ade-3084-4ddb-93a1-e6baac7769f7.htseq.counts.gz':
        my_meta = f.metadata
        for key in my_meta:
            print(key,':', my_meta[key])

The output will look like this:

sample_uuid : fed73c89-75f1-45c9-b14b-7840c52c305f
gender : female
vital_status : dead
case_uuid : d63c186d-add6-492e-8250-e83f46d39d00
disease_type : Bladder Urothelial Carcinoma
aliquot_uuid : 024bc00e-0419-4dc6-8df4-ad3af476314d
aliquot_id : TCGA-DK-A3WX-01A-22R-A22U-07
age_at_diagnosis : 67
sample_id : TCGA-DK-A3WX-01A
race : white
primary_site : Bladder
platform :
sample_type : Primary Tumor
experimental_strategy : RNA-Seq
days_to_death : 321
reference_genome : GRCh38.d1.vd1
case_id : TCGA-DK-A3WX
investigation : Bladder Urothelial Carcinoma
ethnicity : not hispanic or latino

As mentioned earlier, the majority of datasets hosted on the platform have immutable metadata. However, if you have your own files and want to add or modify metadata, you can do so by using the following template:

files = api.files.query(project=test_project, limit=100).all()
for f in list(files):
    if f.name == 'example.bam':
        f.metadata['foo'] = 'bar'
        f.save()
        my_meta = f.metadata
        for key in my_meta:
            print(key,':', my_metadata[key])

Bulk operation: reducing the number of requests

As you can see, f.save() is called each time we want to update the file with new information. This means that the new API request is generated and if we recall that there is a limit of 1,000 requests in 5 minutes, we realize that we can easily hit this number. To solve this issue we can use bulk methods:

all_files = api.files.query(project=test_project, limit=100).all()
all_fastqs = [f for f in all_files if f.name.endswith('.fastq')]
changed_files = []

for f in all_fastqs:
    if f is not None:
        f.metadata['foo'] = 'bar'
        changed_files.append(f)

if changed_files:
    changed_files_chunks = [changed_files[i:i + 100] for i in range(0, len(changed_files), 100)]
    for cf in changed_files_chunks:
        api.files.bulk_update(cf)

In this example, we chose to set metadata only for the FASTQ files found within the project. We perform that by triggering only one API request per 100 files – we first split the list of files into chunks of at most 100 files and then run api.files.bulk_update() method on each chunk.

To learn about other useful bulk operations, check out this section in the API Python documentation.

Managing batch tasks

Another point that might appear necessary is batch analysis handling. If there had been batch tasks in the project which we used for printing out task status information in one of the previous examples, we would not have gotten metrics which include children tasks statuses as well. To be able to query children tasks, we need a couple of additions to our API code.

First, we see how we can check which tasks are batch tasks:

test_project = 'username/project_name'
tasks = api.tasks.query(project=test_project).all()
for task in tasks:
    if task.batch:
        print(task.name)

What usually happens is that we want to automatically rerun the failed tasks within batch analysis. Here is how we can filter out failed tasks for the given batch analysis and rerun those tasks with an updated app/workflow:

test_project = 'username/project_name'
my_batch_id = 'BATCH_ID'
my_batch_task = api.tasks.get(my_batch_id)
batch_failed = list(api.tasks.query(project=test_project,
                                        parent=my_batch_task.id,
                                        status='FAILED',
                                        limit=100).all())
print('Number of failed tasks: ', len(batch_failed))

for task in batch_failed:
    old_task = api.tasks.get(task.id)
    api.tasks.create(name='RERUN - ' + old_task.name,
                     project=old_task.project,
                     # App example: cgcuser/demo-project/samtools-depth/8
                     # You can copy this string from the app's URL
                     app=api.apps.get('username/project_name/app_name/revision_number'),
                     inputs=old_task.inputs,
                     run=True)
    print('Running: RERUN - ' + old_task.name)

As you can see, there is no need to specify inputs for each task separately – we can automatically pass those inputs from the old tasks. However, this only relates to re-running the tasks in the same project. If you want to re-run the tasks in a different project, you will need to write a couple of additional lines to copy the files and apps.

Collecting specific outputs

Finally, when you are happy with your analysis and want to fetch and further examine specific outputs, you can refer to the next example. Here we share the code for copying all the files belonging to a particular output node from all successful runs to another project. The code can be easily modified to fit your purpose, e.g. renaming the files, deleting them if you want to offload the storage, providing those as inputs to another app, etc. The code may appear a bit more complex because it contains error handling and logging, but the iteration through the tasks and collecting the files of interest is fairly simple.

The resulting code will look like this:

import sevenbridges as sbg
import time
from sevenbridges.http.error_handlers import (rate_limit_sleeper, maintenance_sleeper, general_error_sleeper)
import logging

logging.basicConfig(level=logging.INFO)
time_start = time.time()
my_token = '<INSERT_TOKEN>'
# Copy project slug from the project URL
my_project = 'username/project_name'
# Get output ID from the "Outputs" tab within "Ports" section on the app page
my_output_id = 'output_ID'
api = sbg.Api('https://cgc-api.sbgenomics.com/v2', token=my_token, advance_access=True,
              error_handlers=[rate_limit_sleeper, maintenance_sleeper, general_error_sleeper])

tasks_queried = list(api.tasks.query(project=my_project, limit=100).all())
task_files_list = []
print('Tasks fetched: ', time.time() - time_start)
for task in tasks_queried:
    if task.batch:
        ts = time.time()
        children = list(api.tasks.query(project=my_project,
                                        parent=task.id,
                                        status='COMPLETED',
                                        limit=100).all())
        print('Query children tasks', time.time() - ts)
        for t in children:
            task_files_list.append(t.outputs[my_output_id])
    elif task.status == 'COMPLETED':
        task_files_list.append(task.outputs[my_output_id])

print('Outputs from all tasks collected: ', time.time() - time_start, '\n')
fts = time.time()

for f in task_files_list:
    print('Copying: ', f.name)
    # Uncomment the following line if everything works as expected after inserting your values
    # f.copy(project='username/destination_project')

print('\nAll files copied :', time.time() - fts)
print('All finished: ', time.time() - time_start)

There are two major cases – either we run into the batch task in which case we need to iterate through all completed children tasks, or we simply run into an individual task and only need to check if the status is “COMPLETED”. If any of these two conditions are satisfied, we collect the output of interest (lines 19 to 30) and then loop through those to copy the files (lines 35 to 38).

If you have any questions or need help with using API, feel free to contact our support.