Cancer Genomics Cloud

Comprehensive tips for reliable and efficient analysis set-up

Objective

This guide is designed to help you with your first set of projects on the CGC. Each section has specific examples and instructions to demonstrate how to accomplish each step. Also listed are common mistakes to avoid while setting up a project. If you need more information on a particular subject, our Knowledge Center has additional information on all of the features of the CGC. Additionally, our Support Team is available 24/7 to help.

Helpful terms to know

Tool / App (interchangeably used) – refers to a stand-alone bioinformatics tool or its Common Workflow Language (CWL) wrapper that is created or already available on the platform.

Workflow / Pipeline (interchangeably used) – denotes a number of apps connected together in order to perform multiple analysis steps in one run.

Task – represents an execution of a particular app or workflow on the platform. Depending on what is being executed (app or workflow), a single task can consist of only one tool execution (app case) or multiple executions (one or more per each app in the workflow).

Job – this refers to the “execution” part from the “Task” definition (see above). It represents a single run of a single tool found within a workflow. If you are coming from a computer science background, you will notice that the definition is quite similar to a common understanding of the term “job” (wikipedia). Except that the “job” is a component of a bigger unit of work called a “task” and not the other way around, as in some other areas may be the case. To further illustrate what job means on the platform, we can visually inspect jobs after the task has been executed using the View stats & logs panel (button in the upper right corner on the task page):

Figure 1. The jobs for an example run of RNA-Seq Quantification (HISAT2, StringTie) public workflow.

The green bars under the gray ones (apps) represent the jobs (Figure 1). As you can see, some apps (e.g. HISAT2_Build) consist of only one job, whereas others (e.g. HISAT2) contain multiple jobs that are executed simultaneously.

Cancer Genomics Cloud

Comprehensive tips for reliable and efficient analysis set-up

Objective

Helpful terms to know

User Accounts & Billing Groups

Further Reading

Tips for Running Apps/Workflows

Start with the descriptions

Test the workflow

Specify computational resources

Learn about Instance Profiles

Scale up with Batch Analysis

Parallelize with Scatter

Configuring default computational resources

Further analysis and interpretation of your Results

Getting started

JupyterLab environment

Accessing the files

Saving the created files

RStudio Environment

Accessing and saving the files in RStudio

Further reading

Data Browser Essentials

Quickstart

Data Browser Querying Logic

Computational Limits

Storage Pricing

Appendix

Using the API

Install and configure

Managing tasks

Managing files

Bulk operation: reducing the number of requests

Managing batch tasks

Collecting specific outputs