My program reads files from a directory. How can I use the SDK to help run my app on the CGC?

Posted in Add your tools by Zoidberg Wed Mar 02 2016 22:11:12 GMT+0000 (UTC)·1·Viewed 594 times

My program reads and writes files from a directory but the CGC doesn’t have a folder structure. How can I use the SDK to help run my app on the CGC?
Gaurav Kaushik
Mar 2, 2016

Files on the CGC are stored in Amazon S3 buckets, which are object stores. When you run a task, the files and your Docker container are copied to an Amazon EC2 instance, where the task is executed with your tools, environment, and data. Therefore, there is no native support for a directory structure. However, the Rabix SDK can be used to run applications that depend on folders with minor modifications to either data or existing code.

For example, let’s say that you have an R script that expects a folder name as a positional argument. It then reads the files inside the directory to run properly. The command line would look like this:

Rscript myScript.R myFolder

Since users cannot select a directory as a valid input, you may take one of two approaches to run this app on the CGC:

1) You can modify your script to accept multiple files from the command line. The resulting command line would resemble this:

Rscript myScript.r file1.ext file2.ext file3.ext ...

If your command line can accept a series of files, you can simply create an Input port for an array of files in the SDK.

In the Tasks page, when you click on 'Pick files', you'll be able to select multiple files from the file browser. This approach will require more change to your code but is more straightforward to implement

2) If you want an approach that requires no change to your existing codebase, you can do the following.

Start by tar’ing the directory that your data is in. Then upload the TAR file to the CGC. In your CWL application, you can use command line arguments to untar that file and read from the resulting directory.

For example, I have a directory called myFolder. I then tar it as myFolder.tar. When I create the Common Workflow Language application in the SDK, this app will need to untar the input folder so it can be read by my R script:

tar -xvf myFolder.tar && Rscript myScript.R myFolder

This requires one input port and two dynamic expressions:
Under Inputs, create an input port for a TAR file (make note of the ID you use; for example, input_tar)
Under Base Command [General Tab > Base Command], you create the following expression:

‘tar -xvf ’ + $job.inputs.input_tar.path + ‘ &&’

Under Arguments, create the command that takes the folder name. If your TAR file and the folder that you create by untarring it are the same, you can use set the Value for this Argument as an expression that gets the basename of TAR file:

$job.inputs.input_tar.path.replace(/.*\/|\.[^.]*$/g, '')

If your folder name doesn’t match the basename of the TAR file, you can create an additional Input port for a string that represents your folder name. For example, if you had an Input set with the ID ‘input_dirname’ and the type as string, you can set the Value for the Argument from Step 3 as:

$job.inputs.input_dirname || $job.inputs.input_tar.path.replace(/.*\/|\.[^.]*$/g, '')

You are now able specify a directory name when running a task using this app OR it will take the basename of the TAR file if that field is left empty.

  
Markdown is allowed