Tissue Slides and Gene Expression Data

Hi CGC Team, I am planning to use the tissue slides and the gene expression data from TCGA. Therefore I need the connections between the two units. Since the TCGA barcodes sometimes don't match (e.g. if the portion number is different), I want to use the metadata to get this information. I created the query for it and tried to download the connections. But if I export the corresponding file, I can only download up to 3000 lines. That would be enough, but if I drop all duplicates, I only have 100 lines. Is there a way to download the hole table directly from the query? Best regards, Lena

SGDP reference fasta file for BAM files not provided

Hello, I tried to do my own analysis on BAM files from SGDP project. Due to incompatible fasta file used to generate BAM file all my jobs encounter error. Therefore, I would like to ask you if you could please let me know how can I find the true reference fasta file this project. My error is this: A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found. reference contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT] features contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]

Pull CGC repository images

Hi, I want to know how to pull image from CGC repository. Taken STAR as an example, I had tried to use "docker run -ti cgc-images.sbgenomics.com/admin/sbg-public-data/rna-seq-alignment-star-2-5-4b", but it didn't work. Can anyone tell me how to do, thanks in advance!

TCGA COAD Expression data

I have download TCGA COAD expression data file which have data like shown below. How these values are calculated and what does means by these values. Hybridization REF TCGA-AA-A00E-01A-01R-A002-07 Composite Element REF log2 lowess normalized (cy5/cy3) collapsed by gene symbol ELMO2 -0.201 CREB3L1 2.3005 RPS11 -0.080375 PNMA1 -1.23175

HER2 status not consistent

I got the HER2, ER, PR status of BRCA patients from the BRCA TCGA publication supplementary: https://www.nature.com/articles/nature11412#supplementary-information filename: TCGA_Supplementary Tables 1-4.csv column: HER2_Final_Status But since it is from 2012, after some months, I downloaded the newer clinical data from cbioportal**: http://www.cbioportal.org/study?id=brca_tcga#clinical filename: data_bcr_clinical_data_patient.txt columns: IHC-Status, HER2 fish status However, I noticed that hundreds patients had different her2 status then they had in the old tcga publication: Barcode IHC-status(cbioportal) Her2 fish status (cbioportal) HER2_Final_Status(tcga publication) TCGA-A1-A0SH Equivocal Negative Negative TCGA-A2-A04U Negative Positive Negative TCGA-A2-A0T2 Negative Not Evaluated Negative TCGA-A8-A06R Positive Positive Equivocal At first, I thought there is a method to convert both ihc and fish statuses into one (final status). But I failed to find such method. In TCGA-A1-A0SH, it seems Fish is prefered. In TCGA-A2-A04U , it seems it is IHC. in TCGA-A8-A06R it is neither. **I verified that for those 4 patients, her2 status is consistent between cbioportal and current TCGA clinical files. (her2_fish_status, her2_status_by_ihc in clinical files) Thanks in advance, Maor

why there are no nucleotides in the position of the reference genome after the SAMtools Mpileup

H! I have been creating workflow which consists three main steps ( SAMtools View, SAMtools fadix and SAMtools Mpileup). Firstly, with the SAMtools View I filtered the input bam-file based on a bed-file that contains special regions of the third and fifteenth chromosomes. The input bam-file already aligned and sorted I took from the database. The bed-file was download from my computer and the firs line looks like this (3 193593144 193697811). After that the SAMtools Mpileup took file that contain only necessary chromosomes and the file with the indexed reference (as a reference, I used ucsc.hg19.fasta from the database). At the end of the workflow, I expect to see vcf-file with that contains information about the reference and alternative nucleotide of the third and fifteenth chromosomes. Unfortunately, I get the described file, but there are N in place of the reference allele. Please help me understand what is wrong with my reference file

Costs associated with using data on an AWS volume

I've set up a Volume to access files from a bucket I have under my own AWS account and copied a file into a project. Does this copy incur storage charges of its own?

Unable to extract results/outputs of the tool

Hi, I've built a simple tool using R, pushed it to the CGC repository and ran it. However, I have an issue with retrieving the outputs. It seems as the system cannot find them. For now, the tool produces two files: one called model.pdf and another model.txt. I set the glob values (outputs tab in the tool editor) to model.pdf and model.txt since their names are static. My initial thought of the issue was a wrong working directory. I couldn't find more details in the documentation. - How do I know, what is the working directory of the tool for the current analysis? Can I extract this information from the job or self variable? - Should provide information to mount specific directory? - What else could be wrong? For instance, I can successfully retrieve the stdout.txt (caught standard output) file. Also, the job.tree.log file shows stdout.txt available but no model.pdf nor model.txt. Thank you for the help.

Clarification for TCGA data

I have trouble matching WSI slides to their grade, or TNM. For example, patient: TCGA-BC-A110 has three slide samples: Sample TCGA-BC-A110-01Z (Primary Tumor) Sample TCGA-BC-A110-01A (Primary Tumor) Sample TCGA-BC-A110-11A (Normal tissue) Question 1: Is it correct samples ending with A were all sampled together? Question 2: Can I know which were sampled first? Samples ending with A, or Z? [Pathology report exist only for A, with conclusion of Grade I. Clinical file nationwidechildrens.org_clinical.TCGA-BC-A110.xml states patient had cancer with grade I, and later a reoccurence. It means A is the first tumor event, and Z the second?] Question 3: I noticed pathology reports are never available for Z samples, and only for A. Is there a reason?

secondary files not loaded when in batch runs

Hi! I have a tool which analyzes .bam files and requires .bai files as secondary files. The secondary files settings are ok as the tool works well when a run on a single file is performed. However, when i try to make a batch task, it doesn't work and the error log says "unable to find index file for example.bam" or something like this. Is it a known bug? is there a way to overcome this problem? Thanks a lot!