TCGA meta-data from BAM files
Posted in Add your tools by Franco Caramia Wed Jun 15 2016 05:40:23 GMT+0000 (Coordinated Universal Time)·11·Viewed 1,153 times
I'm developing a tool that uses multiple TCGA BAM and MAF files. I need to associate a mutation by its sample name to a BAM file or vice versa. Is there any metadata available to associate the BAM files to their sample name?
Thanks in advance
I was able to retrieve the metadata using the API. Cheers.
Hi Franco, thanks for your question!
In addition to being able to access metadata via the API, metadata for each TCGA file is also available to each tool. You can perform actions like passing metadata between inputs and outputs or parsing input metadata using ‘dynamic expressions’ (look for the
</>
symbol).When you run a task on the CGC, a job file is submitted to complete execution. The job object points to input files and has information about each file (e.g. path, size) as well as metadata associated with the file. For example, if it’s a TCGA file, metadata such as the Case ID, Sample ID, Age, Gender, Investigation, etc., are all available.
Let’s say, for example, that you’re wrapping a tool that takes a BAM file as an input. So you’ve made an Input Port with ‘input_bam’ as the ID. To see the metadata for ‘input_bam’ in a dynamic expression window, you can use the following command:
$job.inputs.input_bam.metadata
. This will return a set of key-value pairs, where the keys are the metadata attributes.You can return the Sample ID for a TCGA BAM file with the following expression:
$job.inputs.input_bam.metadata.sample_id
. Remember to replace ‘input_bam’ with the Input Port ID you use in your tool.You can also find more documentation on dynamic expressions and how to use them to parse metadata in tools is available here.
Please let me know if this answers your question. If not -- I'm happy to clarify further!
Hi Gaurav,
Thank you for the information, it is very helpful. I do have a couple of questions though:
Many thanks,
Hi Franco,
If you have an array of files, the expression will call the whole array.
To get information from individual files in the array, you can use the following syntax:
$job.inputs.input_bam[i].metadata.sample_id
wherei
is the index for the file you want to examine. Note that the array is 0-indexed so$job.inputs.input_bam[0].metadata.sample_id
will get the Sample ID for the first BAM file, and so on. You can create an expression that loops over each file to capture the metadata as needed.I'd be happy help you with specifics, if you want to comment back with what you concretely wish to do with the metadata.
Thanks again for posting in the forum! Looking forward to hearing from you.
Hi Gaurav,
Thanks for your help again.
I'm trying to execute my tool with any combination of file_name and sample_id for a large number of BAM files
Example:
tool.py -b file_1 -s sample_id_1 -b file_2 -s sample_id_2 ..... -b file_n -s sample_id_n
or
tool.py -in file_1:sample_id_1 -in file_2:sample_id_2 ...... -in file_n:sample_id_n
I think this should be easy enough now that you have explained the syntax.
Cheers,
Franco
Hi Gaurav,
I ended up using the following expression:
var arrayLength = $job.inputs.bam_in.length;
for (var i = 0; i < arrayLength; i++) {
$job.inputs.bam_in[i].metadata.sample_id
}
Do you think this should work?
Thanks
Hi Franco, I believe this expression will only evaluate to the Sample ID of the last file in your array. I'm going to look into this further and get back to you. Thanks for updating here -- this is a really interesting use case!
Hi Gaurav,
Thanks for the help.
I can work around this by generating my own metadata file with the API, but it would be great if that step could be avoided.
I also notice when configuring the tool settings there is a Requires SBG metadata extension check box. Is this related to the my issue at all ?
Cheers,
Hi Franco,
I think I have a solution for your use case!
In the Input Port for your BAM files (ID: bam_in), you can input the following Dynamic Expression for the Value field:
{ var commandLine = ''; for (var i = 0; i < $job.inputs.bam_in.length; i++) { commandLine += ' -in '; commandLine += $job.inputs.bam_in[i].path.replace(/.*\/|\.[^.]*$/g, '') commandLine += ':' commandLine += $job.inputs.bam_in[i].metadata.sample_id; } return commandLine; }
I'll walk you through the expression. First, we create an empty string called
commandLine
. In thefor
loop, this expression then adds the prefix-in
, then the basename of the BAM file (note the regular expressions that pop the extension and path to the file), then the separator:
, then the metadata value (Sample ID).The end result is:
-in bam_in-1:sample_id1 -in bam_in-2:sample_id2 -in bam_in-3:sample_id3
... and so on, which gets passed to the command line withreturn commandLine
.This is the solution for your second example (two posts above). For your first example command line, here's the expression:
{ var commandLine = ''; for (var i = 0; i < $job.inputs.bam_in.length; i++) { commandLine += ' -b '; commandLine += $job.inputs.bam_in[i].path.replace(/.*\/|\.[^.]*$/g, '') commandLine += ' -s '; commandLine += $job.inputs.bam_in[i].metadata.sample_id; } return commandLine; }
Which evaluates to
-b bam_in-1 -s sample_id1 -b bam_in-2 -s sample_id2 -b bam_in-3 -s sample_id3
....Note that all you have to do for the second solution is change the lines with the prefixes and/or separators.
Franco, let me know if you have any questions. This was really fun and helpful to work on. Please keep us updated on your progress :)
Hi Franco,
By the way -- make sure to select SBG Metadata and Stage Input (Link) in the Input Port! This will make sure your metadata is available and that the files are linked to the working directory. :)
Hi Gaurav,
That worked really well!, though I had to use $self instead of $job in order to pass the validation stage.
Many thanks!