TCGA BAM file size inconsistency with GDC?

Hi, I'm doing somatic mutation calling of a TCGA patient TCGA-AR-A1AO and am using the BAM file in databrower. I'm using the BAM files TCGA-AR-A1AO-10A-01D-A12Q-09_IlluminaGA-DNASeq_exome_gdc_realn.bam uuid:30f1d9e3-e6a5-44b6-846c-1497806d301c size: 27.03GB TCGA-AR-A1AO-01A-01D-A12Q-09_IlluminaGA-DNASeq_exome_gdc_realn.bam uuid: 33eeb804-ca8b-491e-8221-a285743be692 size: 25.53GB However, on GDC portal, the files are 29.02GB and 27.41GB respectively. I wonder if those files are really up to date as the file sizes are different and my somatic mutation calling result using Varscan2 is missing variants comparing to GDC results(Under same parameters and inputs.) It is just confusing so I am troubleshooting right now. Woule you please help me on this? Thanks! Best, Stella

Modify Read-only file in terminal

Hi! I am analyzing the TCGA MAF files using terminal in Data cruncher. Some MAF files are .gz format and I have to unzip them. But it's a read-only file system. Would you please help find a solution? Thank you! Best, Yiyun

Varscan2 work flow from BAM producing too few somatic mutation calls?

Hi! I am recently using the Varscan2 workflow from BAM to do somatic mutation calling of TCGA GRch38 BAM files. However, the output high confidence vcf files is only a few kb large. One of the patient I was looking, TCGA-AR-A1AO has around 6000 mutations called in the MuTect vcf but only have 300 mutations in my output. I didn't change any parameters. I wonder if it's the problem of input files but I was just using the tumor-normal bam in TCGA.

Centrifuge custom index

I've used centrifuge locally to generate an index based on the GenBank database. The output of this operation is 4 files with the *.cf extension. once I try to use them to run a centrifuge run on the cloud it asks for the index in tar.gz format. what does it mean? Then I'm trying to run the indexing on the cloud but I've found only a pipe to use RefSeq but not GenBank and in general, I would like to do it locally to have more freedom. now I'm running this script on the cloud: "Reference Index Creation " to create an index based on RefSeq and have the output in tar format but it is not exactly what I want to do. there is a way to have the index in tar format with centrifuge locally or to use genbank database in this cloud app to generate the index?

How to download many files from a project to a volume?

Greetings. I was wondering if you could help me. I have a number of files in a project that I am on that I would like to copy to an Amazon S3 bucket. I have mounted the bucket on a volume. However, it is not clear how to copy the files to the bucket. While https://docs.cancergenomicscloud.org/docs/aws-cloud-storage-tutorial#move-file-from-project shows how to move a particular file, I would like to move many files. Is this something that can be done via the Cancer Genomics Cloud GUI/web interface, or is there is some link showing me how it can be done? Many thanks!

STAR genome generate (2.7.0e) error

Hi, I'm using STAR genome generate and STAR from public apps (both 2.7.0e) to align a human RNA-seq data (uploaded privately) and I'm using GRCh38.primary_assembly.genome.fa and gencode.v32.annotation.gtf as the reference genome and gene annotation file for genome indices generation. I keep getting this error: Command mkdir genomeDir && STAR --runMode genomeGenerate --genomeDir ./genomeDir --runThreadN 20 --genomeChrBinNbits 16 --limitGenomeGenerateRAM 60000000000 --genomeFastaFiles /sbgenomics/workspaces/2bc67190-cbb2-43ba-866b-ca9e77ce024a/tasks/a72a496a-3859-480c-8de1-31c0e332b50e/star_genome_generate_2_7_0e/GRCh38.primary_assembly.genome.fa --sjdbGTFfile /sbgenomics/workspaces/2bc67190-cbb2-43ba-866b-ca9e77ce024a/tasks/a72a496a-3859-480c-8de1-31c0e332b50e/star_genome_generate_2_7_0e/gencode.v32.annotation.gtf && tar -vcf GRCh38.primary_assembly.genome.gencode.v32.annotation.star-2.7.0e-index-archive.tar ./genomeDir && mv Log.out Log.out.log failed with exit code 137. Can someone tell me how to solve this? Thank you!

Tissue Slides and Gene Expression Data

Hi CGC Team, I am planning to use the tissue slides and the gene expression data from TCGA. Therefore I need the connections between the two units. Since the TCGA barcodes sometimes don't match (e.g. if the portion number is different), I want to use the metadata to get this information. I created the query for it and tried to download the connections. But if I export the corresponding file, I can only download up to 3000 lines. That would be enough, but if I drop all duplicates, I only have 100 lines. Is there a way to download the hole table directly from the query? Best regards, Lena

SGDP reference fasta file for BAM files not provided

Hello, I tried to do my own analysis on BAM files from SGDP project. Due to incompatible fasta file used to generate BAM file all my jobs encounter error. Therefore, I would like to ask you if you could please let me know how can I find the true reference fasta file this project. My error is this: A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found. reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT] features contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]

Pull CGC repository images

Hi, I want to know how to pull image from CGC repository. Taken STAR as an example, I had tried to use "docker run -ti cgc-images.sbgenomics.com/admin/sbg-public-data/rna-seq-alignment-star-2-5-4b", but it didn't work. Can anyone tell me how to do, thanks in advance!

TCGA COAD Expression data

I have download TCGA COAD expression data file which have data like shown below. How these values are calculated and what does means by these values. Hybridization REF TCGA-AA-A00E-01A-01R-A002-07 Composite Element REF log2 lowess normalized (cy5/cy3) collapsed by gene symbol ELMO2 -0.201 CREB3L1 2.3005 RPS11 -0.080375 PNMA1 -1.23175