AWS Cloud storage tutorial

Overview

The Volumes API contains two types of calls: one to connect and manage cloud storage, and the other to import and export data to and from a connected cloud account.

Before you can start working with your cloud storage via the CGC, you need to authorize the CGC to access and query objects on that cloud storage on your behalf. This is done by creating a "volume". A volume enables you to treat the cloud repository associated with it as external storage for the CGC. You can 'import' files from the volume to the CGC to use them as inputs for computation. Similarly, you can write files from the CGC to your cloud storage by 'exporting' them to your volume. Learn more about working with volumes.

The CGC uses Amazon Web Services as a cloud infrastructure provider. This affects the cloud storage you can access and associate with your CGC account. For instance, you have full read-write access to your data stored in Amazon Web Services' S3 and read-only access to data stored in Google Cloud Storage.

Procedure

This short tutorial will guide you through setting up a volume. You'll connect your Amazon S3 bucket as a volume, make an object from the bucket available on the CGC, then move a file from the CGC to the bucket.

Once a volume is created, you can issue import and export operations to make data appear on the CGC or to move your CGC files to the underlying cloud storage provider.

📘
In this tutorial we assume you want to connect to an Amazon S3 bucket. The procedure will be slightly different for other cloud storage providers, such as a Google Cloud Storage bucket. For more information, please refer to our list of supported cloud storage providers.

Prerequisites

To complete this tutorial, you will need:

An Amazon Web Services (AWS) account
One or more buckets on this AWS account
One or more objects (files) in your target bucket
An authentication token for the CGC. Learn more about getting your authentication token.

Step 1: Add an S3 bucket as a volume

To set up a volume, you have to first register an AWS S3 bucket as a volume. Volumes mediate access between the CGC and your buckets, which are local units of storage in AWS.

👍
(Optional You can also provide your KMS ID if you opt to use KMS for your encryption.

1a: Create an IAM (Identity and Access Management) user

Follow AWS documentation for directions on creating an IAM user.

1b: Create access keys for the IAM user

In the list of IAM users, locate the IAM user you created above. Click the username to configure your options.

Click the Security credentials tab.
In the Access keys section click Create access key. You get two keys, Access key ID and Secret access key.

Copy the credentials for later use. Be sure to keep your credentials somewhere safe. You can also click Download .csv file to obtain them in a file named accessKeys.csv.

1c: Attach your volume to the CGC

From the main menu bar on the CGC, select Data > Volumes.
Click Attach volume. If you already have attached volumes, in the top-right corner click Connect Storage.
Select amazon web services.
Enter Access key ID and Secret access key you obtained in section 1b above.
Click Next.
In the Bucket name field enter the name of the S3 bucket you wish to connect. Volume name is the display name of the volume on the CGC and will be generated automatically.
(Optional) Enter volume description.
Set access privileges for the volume. Available options are:
- Read only (RO) - You will be able to read files, but won't be able to add them to the volume.
- Read and Write (RW) - You will be able to read files and also add files to the volume.
Click Next. You are now taken to the generated policy.

Copy the content of the box.
In the list of IAM users in AWS Management Console, locate the IAM user you created in section 1a. Click the username to configure your options.
On the Permissions tab, click Add inline policy, as shown below.

Select the JSON tab and replace the existing content by pasting the code you copied in step 10.

Click Review policy.
Enter a descriptive policy name, e.g. sb-access-policy. Note that you can only use alphanumerics and the following characters: +=,.@-_ .
Click Create policy.
Then, go back to the CGC and click Next in the wizard.
In the Endpoint field enter s3.amazonaws.com. Leave default values for other settings.
Click Next. You can now review your volume connection settings.
Finally, click Connect. Your volume should now be connected to the CGC and visible in the list of volumes.

Step 2: Make an object from the bucket available on the CGC

Now that we have a volume, we can make data objects from the bucket associated with the volume available as "aliases" on the CGC. Aliases point to files stored on your cloud storage bucket and can be copied, executed, and organized like normal files on the CGC. We call this operation "importing". Learn more about working with aliases.

To import a data object from your volume as an alias on the CGC, follow the steps below.

2a: Launch an import job

To import a file, make the API request to start an import job as shown below. In the body of the request, include the key-value pairs in the table below.

Key	Description of value
`volume_id` required	Volume ID from which to import the file. This consists of your username followed by the volume's name, such as `rfranklin/sb-volume-demo`.
`location` required	Volume-specific location pointing to the file to import. This location should be recognizable to the underlying cloud service as a valid key or path to the file. Please note that if this volume was configured with a `prefix` parameter when it was created, the `prefix` will be prepended to location before attempting to locate the file on the volume.
`destination` required	This object should describe the CGC destination for the imported file.
`project` required	The project in which to create the alias. This consists of your username followed by your project's short name, such as `rfranklin/my-project`.
`name`	The name of the alias to create. This name should be unique to the project. If the name is already in use in the project, you should use the overwrite query parameter in this call to force any file with that name to be deleted before the alias is created. If name is omitted, the alias name will default to the last segment of the complete location (including the `prefix`) on the volume. Segments are considered to be separated with forward slashes ('/').
`overwrite`	Specify as `true` to overwrite the file if the file with the same name already exists in the destination.

POST /v2/storage/imports HTTP/1.1
Host: cgc-api.sbgenomics.com
X-SBG-Auth-Token: 3259c50e1ac5426ea8f1273259740f74
content-type: application/json

{ 
   "source":{ 
      "volume":"rfranklin/sb-volume-demo",
      "location":"example_human_Illumina.pe_1.fastq"
   },
   "destination":{ 
      "project":"rfranklin/my-project",
      "name":"my_imported_example_human_Illumina.pe_1.fastq"
   },
   "overwrite": true
}

The returned response details the status of your import, as shown below.

{
  "href": "https://cgc-api.sbgenomics.com/v2/storage/imports/5rand0mXYcDQ3xtSHrKuK2jXNDtJhMBN",
  "id": "5rand0mXYcDQ3xtSHrKuK2jXNDtJhMBN",
  "state": "PENDING",
  "overwrite": true,
  "source": {
    "volume": "rfranklin/sb-volume-demo",
    "location": "example_human_Illumina.pe_1.fastq"
  },
  "destination": {
    "project": "rfranklin/my-project",
    "name": "my_uploaded_example_human_Illumina.pe_1.fastq"
  }
}

Locate the id property in the response and copy this value to your clipboard. This id is an identifier for the import job, and we will need it in the following step.

2b: Check if the import job has completed

To check if the import job has completed, make the API request to get details of an import job, as shown below. Simply append the import job id obtained in the step above to the path.

GET /v2/storage/imports/5rand0mXYcDQ3xtSHrKuK2jXNDtJhMBN HTTP/1.1
Host: cgc-api.sbgenomics.com
X-SBG-Auth-Token: 3259c50e1ac5426ea8f1273259740f74
content-type: application/json

The returned response details the state of your import. If the state is COMPLETED, your import has successfully finished. If the state is PENDING, wait a few seconds and repeat this step.

You should now have a freshly-created alias in your project. To verify that a file has been imported, visit this project in your browser and look for a file with the same name as the key of the object in your bucket.

Step 3: Move a file from the CGC to the bucket

You've successfully created an alias on the CGC for a file in your S3 bucket. You can also move files from the CGC into your connected S3 bucket. This operation is known as 'exporting' to the volume associated with the bucket. Please note that export to a volume is available only via the API (including API client libraries), and through the Seven Bridges CLI. Also, keep in mind that public files, files belonging to CGC-hosted datasets, archived files, and aliases cannot be exported. For more information, please see working with aliases.

Follow the steps below to move a file from CGC to an object in your bucket.

3a: Upload a file to a project

Before you can export a file from the CGC, you must upload a file to a project. To upload a file, follow the steps below:

Upload a file to your project using the command line uploader, upload via the visual interface, using an FTP or HTTP(S) server, or the API.
Locate and copy the file ID. From the various upload mechanisms, you can find the file ID as follows:

Command line uploader - In the output of the command line uploader, note that the first column in the line that corresponds to the uploaded file. This is the uploaded file's ID.
Upload via the visual interface - Once the file has uploaded, locate the file in the Files tab of the relevant project. Click on the file's name. A new page with details about your file should open. Locate the last segment of this page's URL, following /files/. This is the uploaded file's ID. For example, the file ID of https://cgc.sbgenomics.com/u/rfranklin/volumes-api-project/files/577d4c35e4b05e75806f2853/ is 577d4c35e4b05e75806f2853.
FTP or HTTP(S) server - Locate the file's ID in the same way as for files uploaded using the visual interface.
API - Issue the API request to List all files within a project. The IDs of each file are listed next to the key id in the response body.

###3b: Move a file from your project on the CGC to the bucket

When you export a file from the CGC to your volume, you are writing to your S3 bucket.

Make the API request to start an export job to move a file from the CGC to your bucket, as shown below. In the body of your request, include the key-value pairs from the table below.

Key	Value
`source` required	This object should describe the source from which the file should be exported.
`file` required	The CGC-assigned ID of the file for export.
`destination` required	This object should describe the destination to which the file will be exported.
`volume` required	The ID of the volume to which the file will be exported.
`location` required	Volume-specific location to which the file will be exported. This location should be recognizable to the underlying cloud service as a valid key or path to a new file. Please note that if this volume has been configured with a `prefix` parameter, the value of `prefix` will be prepended to `location` before attempting to create the file on the volume.
`properties`	Service-specific properties of the export. These values override the defaults from the volume.
`sse_algorithm` default: AES256	S3 server-side encryption to use when exporting to this bucket. Supported values: `AES256` (SSE-S3 encryption) `aws:kms` * `null` (no server-side encryption)
`sse_aws_kms_key_id`	Provide your AWS KMS ID here if you specify `aws:kms` as your `sse_algorithm`. Learn more about AWS KMS.

POST /v2/storage/exports HTTP/1.1
Host: cgc-api.sbgenomics.com
X-SBG-Auth-Token: 3259c50e1ac5426ea8f1273259740f74
content-type: application/json

{
 "source": {
   "file": "567890abc4f3066bc3750174"
 },
 "destination": {
   "volume": "rfranlin/sb-volume-demo",
   "location": ""
 },
"properties": {
   "sse_algorithm": "AES256"
 }
}

The returned response details the status of your import, as shown below.

{
  "href": "https://cgc-api.sbgenomics.com/v2/storage/exports/trand0mgPWZbeXxtADETWtFkrE87JBSd",
  "id": "trand0mgPWZbeXxtADETWtFkrE87JBSd",
  "state": "PENDING",
  "source": {
    "file": "567890abc4f3066bc3750174"
  },
  "destination": {
    "volume": "rfranklin/sb-volume-demo",
    "location": "output.vcf"
  },
  "started_on": "2016-06-15T19:17:39Z",
  "properties": {
    "sse_algorithm": "AES256",
    "aws_storage_class": "STANDARD",
    "aws_canned_acl": "public-read"
  },
  "overwrite": false
}

Locate the id property in the response and copy this to your clipboard. This id is the identifier for the export job, and we will use it in the next step to verify that the job has completed.

3c: Check if the export job has completed

To check the status of your export job, make the API request to get details of an export job. Append the export id you obtained in the step above after the path.

GET /v2/storage/exports/trand0mgPWZbeXxtADETWtFkrE87JBSd HTTP/1.1
Host: cgc-api.sbgenomics.com
X-SBG-Auth-Token: 3259c50e1ac5426ea8f1273259740f74
content-type: application/json

The returned response details the state of your export. If the state is COMPLETED, your export has successfully finished. If the state is PENDING, wait a few seconds and repeat this step.

Your bucket now contains the file that was uploaded to the CGC in step 1. To verify that a file has been exported, visit your project on the CGC and locate the file you originally uploaded. It should be marked as an alias. This means that the content of the file has been moved from storage on the CGC to your S3 bucket, and that the CGC file record been updated accordingly.

Congratulations! You've now registered an S3 bucket as a volume, imported a file from the volume to the CGC, and exported a file from the CGC to the volume. Learn more about connecting your cloud storage from our Knowledge Center.