Some tools on the CGC that process user-specified input files may require complementary files in order to execute properly. These complementary files are commonly referred to as secondary files. A secondary file is usually an index file that allows tools to have indexed random access to the file which that secondary file accompanies. This means that the tool is able to access specific portions of the file directly, without having to iterate through its entire content.
If you are executing an app where at least one input requires a secondary file to be used along with the file it accompanies, follow these guidelines to ensure successful task execution:
- Secondary (index) files need to be added to the same project where the files they accompany are located and where tasks are being executed. If a required index file is not available in the same project, this might result in task failure.
- Secondary (index) files usually don't have to be set explicitly as task inputs. Instead, they are pulled automatically when the task starts, if the files they accompany are set properly as task inputs and you added secondary files (indices) to the project where the task is being executed.
For example, if a file named
sample.bam is used as an input and its index file is also required, all you need to do is make sure that its corresponding
sample.bam.bai file is also present in the same project where the execution takes place.
Most common secondary file types used on the CGC are:
- FAI - Index file format for FASTA files.
- DICT - Sequence dictionaries for FASTA files.
- BAI - Index file format for BAM files.
- CRAI - Index file format for CRAM files.
- TBI - Index file format for tab-delimited files.
- IDX - Index file format for VCF files.
FAI (FASTA Index) files enable indexed random access to FASTA files. They are structured in the same order of sequences as their corresponding FASTA files and have the same name as the FASTA file, with the .fasta.fai extension (for example, a FASTA file named reference.fasta would have a corresponding index file named reference.fasta.fai). A FAI file is essentially a text file consisting of lines with five tab-delimited columns in each line:
|NAME||Name of the reference sequence.|
|LENGTH||Total length of this reference sequence, in bases.|
|OFFSET||Offset (in terms of lines) within the FASTA file of the sequence's first base.|
|LINEBASES||The number of bases on each line.|
|LINEWIDTH||The number of bytes in each line, including the newline.|
A FAI file can be generated from a FASTA file using one of the suitable tools on the CGC, for example:
- SAMtools Index FASTA
- SBG FASTA Indices
A DICT file describes the content of the corresponding FASTA file. It contains information about contigs in the FASTA file and their sizes (in number of bases). A DICT file must have the same name as its corresponding FASTA file, but the extension must be .dict instead of .fasta (or .fa). For example, if there is a file named reference.fasta, the corresponding DICT file needs to be reference.dict. Tools on the CGC that can be used to generate a DICT file from a FASTA file are, for example:
- Picard CreateSequenceDictionary
- SBG FASTA Indices
BAM files can also be accompanied by an index file that has the same name as the BAM file, suffixed with .bam.bai. This file allows (analysis) tools to jump directly to specific alignment lines of the BAM file without starting from the first line and visiting all of the lines in between. A BAI file does not have any purpose without a corresponding BAM file since it does not actually contain any sequence data.
If you have a BAM file that does not have an accompanying index (BAI) file, the BAI file can be generated using one of the adequate tools from the CGC, for example:
- SAMtools Index BAM
- Picard BuildBamIndex
- Sambamba Index
However, the most important prerequisite for generating an index (BAI) is that the BAM file is coordinate-sorted.
A BAI (index) file cannot be generated for an unsorted BAM file.
The BAI file is structured so that it contains two coordinates for every compressed block of a BAM file.
- One coordinate states precisely the offset of the location of the compressed block within the compressed BAM file
- The other coordinate is the offset to the location from within the block to the particular alignment line that we want to access.
CRAI files are external index files for files in the CRAM format (a compressed version of BAM that stores only reads different from the reference sequence). CRAI files follow the same naming convention as the aforementioned index file formats - they have the same name as the corresponding CRAM file, suffixed with .cram.crai.
TBI (tabix) format is the generic index file format for a lot of widely used tab-delimited formats such as GFF/GTF, BED, SAM, VCF etc. Tabix files have the .tbi extension, which is appended to the name of the file for which the TBI index is generated (for example, a file named file.vcf.bgz will have a tabix index file named file.vcf.bgz.tbi). A tabix file can be generated from a tab-delimited file that is position-sorted and zipped by bgzip. To zip a tab-delimited file by bgzip and generate a tabix file on the CGC, you can use tools such as:
- Tabix BGZIP (compresses the tab-delimited file using bgzip).
- Tabix Index (generates a tabix index file from a tab-delimited file that is position-sorted and has previously been zipped by bgzip).
IDX format is the index format used for indexing of VCF files. The purpose of IDX files is to allow tools to have indexed random access to VCF files. VCF index (IDX) files have the .idx extension which is appended to the name of the corresponding VCF file (a file named sample.vcf would have an index file named sample.vcf.idx).
Updated less than a minute ago