Google Cloud Storage tutorial


##Overview

The Volumes API contains two types of calls: one to connect and manage cloud storage, and the other to import data from a connected cloud account.

Before you can start working with your cloud storage via the CGC, you need to authorize the CGC to access and query objects on that cloud storage on your behalf. This is done by creating a "volume". A volume enables you to treat the cloud repository associated with it as external storage for the CGC. You can import files from the volume to the CGC to use them as inputs for computation.


##Procedure

This short tutorial will guide you through setting up a volume for a GCS bucket. You'll register your GCS bucket as a volume and make an object from the GCS bucket available on the CGC.

Once a volume has been created, you can issue import operations to make data appear on the CGC.

📘

In this tutorial we assume you want to connect to an Google Cloud Storage bucket. The procedure will be slightly different for other cloud storage providers, such as an Amazon S3 bucket. For more information, please refer to our list of supported cloud storage providers.


##Prerequisites

To complete this tutorial, you will need:

  1. A Google (GCP) account.
  2. One or more buckets on this GCP account via Google Cloud Storage (GCS).
  3. One or more objects (files) in your target bucket.
  4. An authentication token for the CGC. Learn more about getting your authentication token.


##Step 1: Register a GCS bucket as a volume

To set up a volume, you have to first register a GCS bucket as a volume. Volumes mediate access between the CGC and your buckets, which are local units of storage in GCS.

You can register a GCS bucket as a volume through the following steps below.


###1a: Create an IAM (Identity and Access Management) user

  1. Log into the Google Cloud Platform console.
  2. From the menu on the left select IAM & Admin > Service accounts.
  3. Click + Create service account below the search bar.
  4. Fill in account details:
    • Service account name - Descriptive name to label the account.
    • Service account ID - Generated automatically based on the entered service account name. Can be modified if necessary.
    • Service account description - More elaborate description of the account’s purpose.
  5. Click Create. The Service account permissions screen opens.
  6. In the Select a role dropdown, select Storage > Storage Object Viewer.
  7. Click Continue. The final screen of the wizard opens.
  8. In the Create key section, click + Create key. Key options are displayed on the right.
  9. In the Key type list select JSON.
  10. Click Create. Your browser will download a JSON file containing the credentials for this user. Keep this file safe.


###1b: Authorize this IAM user to access your bucket

  1. On the Google Cloud Platform console, click in the top-left corner and navigate to the Storage section
  2. Select Storage > Browser.
  3. Locate your bucket and click the three vertical dots to the far right of your bucket's name.
  4. Click Edit bucket permissions.
  5. Click Add members.
  6. In the New members field enter the service account client's email. This email is located in the JSON file downloaded in the previous section.
  7. From the Select a role drop-down menu, select Storage Legacy > Storage Legacy Bucket Reader.
  8. Click Save. You have now authorized the newly-created IAM user to access the storage bucket.


###1c: Register a bucket

At this point, you can associate the bucket with your CGC account by registering it as a volume.

To register your bucket as a volume, make the API request to Create a volume, as shown in the HTTP request below. Be sure to paste in your authentication token for the X-SBG-Auth-Token key.

This request also requires a request body. Provide a name for your new volume, an optional description, and an object (service) containing the information in the table below. Specify the access_mode as RO for read-only permissions. Be sure to supply a bucket name and substitute in your own credentials.

KeyDescription of value
type
required
This must be set to gcs.
bucket
required
The name of your GCS bucket.
prefix
default: empty string
If provided, the value of this parameter will be used to modify any object key before an operation is performed on the bucket.

Even though Google Cloud Platform is not truly a folder-based store and allows for almost arbitrarily named keys, the prefix is treated as a folder name. This means that after applying the prefix to the name of the object the resulting key will be normalized to conform to the standard path-based naming schema for files.

For example, if you set the prefix for a volume to a10, and import a file with location set to test.fastq from the volume to the CGC, then the object that will be referred to by the newly-created alias will be a10/test.fast.
client_emailThe client email address for the Google Cloud service account to use for operations on this bucket. This can be found in the JSON containing your service account credentials.
private_keyThe private key for the Google Cloud service account to use for operations on this bucket. This can be found in the JSON containing your service account credentials.
POST /v2/storage/volumes HTTP/1.1
Host: cgc-api.sbgenomics.com
X-SBG-Auth-Token: 3259c50e1ac5426ea8f1273259740f74
content-type: application/json
{
  "name": "tutorial_volume",
  "description": "New GCS volume",
  "service": {
    "type": "gcs",
    "bucket": "test-bucket",
    "prefix": "",
    "credentials": {
      "client_email": "[email protected]",
      "private_key": "-----BEGIN PRIVATE KEY-----\nrand0mIBAAKCAQEAsXj4E7swo97szcOrAcraSbsGnNuTU1b/4llyspDa0lltZIKL\nfl5s3QoqbjUWqAZXkJexei85g49ULD8BGKH2r4EF+XyKcpoon4uIFcbmYcmsUXM\nJ3ujgyL5DbWnQZ6GrqgFNRFVVz/PuvTZOd6KFCrjbbtCxfKoXQrmCwFC/4NlFR3v\n1kavU81w201Mied3e+pxjfiQKAJOoy5I7kfuH20xfzHXWR2YHdQGbzOUZyPgmzZ6\nH6Ry39b7bgLVbyk3++e13KrsTEf58rRzUHLzlcUDcGyf8iTO2vA2qzcbrbovwqJr\n7H4ZfFllDMYQ/ISj4cmi+sz/hR43LUK86emrXwIDAQABAoIBADBr2fvAMbINsZm+\njjTh/ObrAWXgvvSZIx3F2/Z+cUW9Ioyu1ZJ3/uncMTF6iKD1ggSwbqVQIq7zKaWP\ndGNZ4sk62PEQSx8924iiNsGaIqyj5FmvuoD3SeiorR0hd+3+a67RpwIQpaE1ht7y\nmSYh4riX7w9sbU6G44rnQ1azVG1UHvk5ieOD4OPvJopuc6D6ow1oJOnHE0k8v3HY\n1FpLdWCL6nSERqXOI5w+tllG4NMUmTZ2jhaBSEM4PIJVO+24TM3XFCcvhZ7ipPMF\nP5B8hV4hDA4Av1Ei7iuRZlJsH4sRrtHJE3/FZLgqHRRvt/7w4c1xnwirNghtTNMb\nXVoaS/ECgYEA15vL3l22mIoePlcCxIgVCAxhKm6TVQZsAE2EaeVsJKDl0AgCtn/1\nThMIPPGkO8jmjqHGgA+FhjoUQuCCdIuON00mUpmUxZlwI5+uknuK597/zAjd6W8s\n7p9apvBUDfod0hwF9Jfw+aUtZm6EAUNR1Odbb+bpXp1luwfcesHe4QcCgYEA0rg8\nZBBwh2DetU6wWh2JIejBH5SfRUqtEwo5WiEZhrEQLazcpX4w5uvESnT+xd7qx3yC\n/vyzqmy+YwP92Ql0vZApdQoyKGHVntY/o3HYxZD3x+7BKThUs747WjdSo8SwBkSr\nxEzLBgTqqcho6UXvYTTEAg11F5yNYzbvVf4vROkCgYEAh6XtTamIB9Bd1rrHcv5q\nvPWM7DVFXGj96fLbLAS7VRAlhgyEKG2417YBqNYejb6Hz5TYXhll2F0SAkFd0hU7\nFG/lfHJDt04hz0fXfTFc4yTZqnSpqQPZMQfw8LajK2gA+v/Gf2xYn7fcKGW/h0vj\nYB9u16hfirdcGZ+Ih3MR1mECgYEAnq1b1KJIirlYm8FYrVOGe4FxRF2/ngdA05Ck\nZYl9Vl8pZqvAL+MZ4hpyYvs9CzX1KClL38XdaZ2ftKJB2tjzDZYl9Vl8pZqvAL+MZ4hpyYvs9CzX1KClL38XdaZ2ftKJB2tjzDZYl9Vl8pZqvALJlQZYl9Vl8pZqvAL+MZ4hpyYvs9CzX1KClL38XdaZ2ftKJB2tjzDZYl9Vl8pZqvAL+CxZYl9Vl8pZqvAL+MZ4hpyYvs9CzX1KClL38XdaZ2ftKJB2tjzDZYl9Vl8pZqvAL+MjZYl9Vl8pZqvAL+MZ4hpyYvs9CzX1KClL38XdaZ2ftKJB2tjzDZYl9Vl8pZqvALSi0sVSXpA=\n-----END PRIVATE KEY-----"
    }
  },
  "access_mode": "RO"
}

You'll see a response providing the details for your newly created volume, as shown below.

{
  "href": "https://cgc-api.sbgenomics.com/v2/storage/volumes/rfranklin/tutorial_volume",
  "id": "rfranklin/tutorial_volume",
  "name": "tutorial_volume",
  "description": "New GCS volume",
  "access_mode": "RO",
  "service": {
    "type": "gcs",
    "bucket": "test-bucket",
    "prefix": "",
    "credentials": {
      "client_email": "[email protected]"
    }
  },
  "created_on": "2019-05-26T16:44:20Z",
  "modified_on": "2019-05-26T16:44:20Z",
  "active": true
}


##Step 2: Make an object from the bucket available on the CGC

Now that we have a volume, we can make data objects from the bucket associated with the volume available as "aliases" on the CGC. Aliases point to files stored on your cloud storage bucket and can be copied, executed, and organized like normal files on the CGC. We call this operation "importing". Learn more about working with aliases.

To import a data object from your volume as an alias on the CGC, follow the steps below.


###2a: Launch an import job

To import a file, make the API request to start an import job as shown below. In the body of the request, include the key-value pairs in the table below.

KeyDescription of value
volume_id
required
Volume ID from which to import the file. This consists of your username followed by the volume's name, such as rfranklin/tutorial_volume.
location
required
Volume-specific location pointing to the file to import. This location should be recognizable to the underlying cloud service as a valid key or path to the file.

Please note that if this volume was configured with a prefix parameter when it was created, the prefix will be prepended to location before attempting to locate the file on the volume.
destination
required
This object should describe the destination for the imported file on the CGC.
project
required
The project in which to create the alias. This consists of your username followed by your project's short name, such as rfranklin/my-project.
nameThe name of the alias to create. This name should be unique to the project. If the name is already in use in the project, you should use the overwrite query parameter in this call to force any file with that name to be deleted before the alias is created.

If name is omitted, the alias name will default to the last segment of the complete location (including the prefix) on the volume. Segments are considered to be separated with forward slashes ('/').
overwriteSpecify as true to overwrite the file if the file with the same name already exists in the destination.
POST /v2/storage/imports HTTP/1.1
Host: cgc-api.sbgenomics.com
X-SBG-Auth-Token: 3259c50e1ac5426ea8f1273259740f74
content-type: application/json
{ 
   "source":{ 
      "volume":"rfranklin/tutorial_volume",
      "location":"example_human_Illumina.pe_1.fastq"
   },
   "destination":{ 
      "project":"rfranklin/my-project",
      "name":"my_imported_example_human_Illumina.pe_1.fastq"
   },
   "overwrite": true
}

The returned response details the status of your import, as shown below.

{
  "href": "https://cgc-api.sbgenomics.com/v2/storage/imports/arand0mgUhk2SpByRg5SPZqTLOEYqG8o",
  "id": "arand0mgUhk2SpByRg5SPZqTLOEYqG8o",
  "state": "PENDING",
  "overwrite": true,
  "source": {
    "volume": "rfranklin/tutorial_volume",
    "location": "example_human_Illumina.pe_1.fastq"
  },
  "destination": {
    "project": "rfranklin/my-project",
    "name": "my_uploaded_example_human_Illumina.pe_1.fastq"
  }
}

Locate the id property in the response and copy this value to your clipboard. This id is an identifier for the import job, and we will need it in the following step.


###2b: Check if the import job has completed

To check if the import job has completed, make the API request to get details of an import job, as shown below. Simply append the import job id obtained in the step above to the path.

GET /v2/storage/imports/arand0mgUhk2SpByRg5SPZqTLOEYqG8o HTTP/1.1
Host: cgc-api.sbgenomics.com
X-SBG-Auth-Token: 3259c50e1ac5426ea8f1273259740f74
content-type: application/json

The returned response details the state of your import. If the state is COMPLETED, your import has successfully finished. If the state is PENDING, wait a few seconds and repeat this step.

You should now have a freshly-created alias in your project. To verify that a file has been imported, visit this project in your browser and look for a file with the same name as the key of the object in your bucket.