TCGA BAM file size inconsistency with GDC?

Posted in TCGA data on the CGC by Yiyun Rao Fri Feb 07 2020 21:47:12 GMT+0000 (Coordinated Universal Time)·1·Viewed 314 times

Hi, I'm doing somatic mutation calling of a TCGA patient TCGA-AR-A1AO and am using the BAM file in databrower. I'm using the BAM files TCGA-AR-A1AO-10A-01D-A12Q-09_IlluminaGA-DNASeq_exome_gdc_realn.bam uuid:30f1d9e3-e6a5-44b6-846c-1497806d301c size: 27.03GB TCGA-AR-A1AO-01A-01D-A12Q-09_IlluminaGA-DNASeq_exome_gdc_realn.bam uuid: 33eeb804-ca8b-491e-8221-a285743be692 size: 25.53GB However, on GDC portal, the files are 29.02GB and 27.41GB respectively. I wonder if those files are really up to date as the file sizes are different and my somatic mutation calling result using Varscan2 is missing variants comparing to GDC results(Under same parameters and inputs.) It is just confusing so I am troubleshooting right now. Woule you please help me on this? Thanks! Best, Stella
Feb 24, 2020

Hi Yiyun,

Here's the response from our engineering team:

File sizes on the GDC Data Portal are displayed in gigabytes (GB), while the file size unit displayed on the CGC is gibibyte (GiB). For example, the size of TCGA-AR-A1AO-10A-01D-A12Q-09_IlluminaGA-DNASeq_exome_gdc_realn.bam indeed is 29.02 GB, which is equal to 27.03 GiB, as shown on the CGC. Therefore, there is no difference in the file size itself, but just in the displayed unit.

Feel free to ask if you have more questions.

Thanks,
Marko

  
Markdown is allowed