TCGA meta-data from BAM files

Posted in Add your tools by Franco Caramia Wed Jun 15 2016 05:40:23 GMT+0000 (Coordinated Universal Time)·11·Viewed 1,424 times

I'm developing a tool that uses multiple TCGA BAM and MAF files. I need to associate a mutation by its sample name to a BAM file or vice versa. Is there any metadata available to associate the BAM files to their sample name? Thanks in advance
June 17, 2016

I was able to retrieve the metadata using the API. Cheers.

June 17, 2016

Hi Franco, thanks for your question!

In addition to being able to access metadata via the API, metadata for each TCGA file is also available to each tool. You can perform actions like passing metadata between inputs and outputs or parsing input metadata using ‘dynamic expressions’ (look for the </> symbol).

When you run a task on the CGC, a job file is submitted to complete execution. The job object points to input files and has information about each file (e.g. path, size) as well as metadata associated with the file. For example, if it’s a TCGA file, metadata such as the Case ID, Sample ID, Age, Gender, Investigation, etc., are all available.

Let’s say, for example, that you’re wrapping a tool that takes a BAM file as an input. So you’ve made an Input Port with ‘input_bam’ as the ID. To see the metadata for ‘input_bam’ in a dynamic expression window, you can use the following command: $job.inputs.input_bam.metadata. This will return a set of key-value pairs, where the keys are the metadata attributes.

You can return the Sample ID for a TCGA BAM file with the following expression: $job.inputs.input_bam.metadata.sample_id. Remember to replace ‘input_bam’ with the Input Port ID you use in your tool.

You can also find more documentation on dynamic expressions and how to use them to parse metadata in tools is available here.

Please let me know if this answers your question. If not -- I'm happy to clarify further!

June 20, 2016

Hi Gaurav,

Thank you for the information, it is very helpful. I do have a couple of questions though:

  • In case my input is not a single file but a File array (in this case BAM files), will the dynamic expression behave the same? Will the metadata be presented to the tool as a File array too?

Many thanks,

June 20, 2016

Hi Franco,

If you have an array of files, the expression will call the whole array.

To get information from individual files in the array, you can use the following syntax: $job.inputs.input_bam[i].metadata.sample_id where i is the index for the file you want to examine. Note that the array is 0-indexed so $job.inputs.input_bam[0].metadata.sample_id will get the Sample ID for the first BAM file, and so on. You can create an expression that loops over each file to capture the metadata as needed.

I'd be happy help you with specifics, if you want to comment back with what you concretely wish to do with the metadata.

Thanks again for posting in the forum! Looking forward to hearing from you.

June 20, 2016

Hi Gaurav,

Thanks for your help again.

I'm trying to execute my tool with any combination of file_name and sample_id for a large number of BAM files

Example: -b file_1 -s sample_id_1 -b file_2 -s sample_id_2 ..... -b file_n -s sample_id_n

or -in file_1:sample_id_1 -in file_2:sample_id_2 ...... -in file_n:sample_id_n

I think this should be easy enough now that you have explained the syntax.


June 22, 2016

Hi Gaurav,

I ended up using the following expression:

var arrayLength = $job.inputs.bam_in.length;
for (var i = 0; i < arrayLength; i++) {

Do you think this should work?

June 22, 2016

Hi Franco, I believe this expression will only evaluate to the Sample ID of the last file in your array. I'm going to look into this further and get back to you. Thanks for updating here -- this is a really interesting use case!

June 23, 2016

Hi Gaurav,

Thanks for the help.

I can work around this by generating my own metadata file with the API, but it would be great if that step could be avoided.

I also notice when configuring the tool settings there is a Requires SBG metadata extension check box. Is this related to the my issue at all ?


June 23, 2016

Hi Franco,

I think I have a solution for your use case!

In the Input Port for your BAM files (ID: bam_in), you can input the following Dynamic Expression for the Value field:

{ var commandLine = ''; for (var i = 0; i < $job.inputs.bam_in.length; i++) { commandLine += ' -in '; commandLine += $job.inputs.bam_in[i].path.replace(/.*\/|\.[^.]*$/g, '') commandLine += ':' commandLine += $job.inputs.bam_in[i].metadata.sample_id; } return commandLine; }

I'll walk you through the expression. First, we create an empty string called commandLine. In the for loop, this expression then adds the prefix -in, then the basename of the BAM file (note the regular expressions that pop the extension and path to the file), then the separator :, then the metadata value (Sample ID).

The end result is: -in bam_in-1:sample_id1 -in bam_in-2:sample_id2 -in bam_in-3:sample_id3 ... and so on, which gets passed to the command line with return commandLine.

This is the solution for your second example (two posts above). For your first example command line, here's the expression:
{ var commandLine = ''; for (var i = 0; i < $job.inputs.bam_in.length; i++) { commandLine += ' -b '; commandLine += $job.inputs.bam_in[i].path.replace(/.*\/|\.[^.]*$/g, '') commandLine += ' -s '; commandLine += $job.inputs.bam_in[i].metadata.sample_id; } return commandLine; }

Which evaluates to -b bam_in-1 -s sample_id1 -b bam_in-2 -s sample_id2 -b bam_in-3 -s sample_id3 ....

Note that all you have to do for the second solution is change the lines with the prefixes and/or separators.

Franco, let me know if you have any questions. This was really fun and helpful to work on. Please keep us updated on your progress :)

June 23, 2016

Hi Franco,

By the way -- make sure to select SBG Metadata and Stage Input (Link) in the Input Port! This will make sure your metadata is available and that the files are linked to the working directory. :)

July 12, 2016

Hi Gaurav,

That worked really well!, though I had to use $self instead of $job in order to pass the validation stage.

Many thanks!

Markdown is allowed