Smart Variant Filtering

Overview

Smart Variant Filtering (SVF) uses machine learning algorithms trained on features from the existing Genome In A Bottle (GIAB) variant-called samples (HG001-HG005) to perform variant filtering (classification).

Smart Variant Filtering increases the precision of called SNVs (removes false positives) for up to 0.2% while keeping the overall f-score higher by 0.12-0.27% than in existing solutions. Indel precision is increased by up to 7.8%, while the f-score increase is in range of 0.1 to 3.2%.

Use the Smart Variant Filtering public project

All Seven Bridges Platform users automatically have copy permissions for this project. This means you can copy the available data to your own projects on the Platform to execute analyses.

You have the options to:

  • Copy the entire project - Start from the copied project and use available apps to filter a VCF file.
  • Select and copy a subset of the data to your own project - Use the selected data within your own analyses in your project.

Access the Smart Variant Filtering public project

To access the Smart Variant Filtering public project:

  1. Click on Public projects from the top navigation bar.
  2. Click the title of the project, as shown below.

You'll be taken to the main dashboard of the Smart Variant Filtering public project. Alternatively, you can choose to copy the project, by clicking Copy project. See below for more information.

Copy the entire project

  1. Click Public projects in the top navigation bar.
  2. Locate the Smart Variant Filtering public project and click Copy project in the lower right corner.

  1. In the pop-up window, you can name your copy of the project, select the billing group and specify whether this project will contain controlled data.
  2. Once you've customized the details, click Copy to copy the entire project.

You'll be redirected to the dashboard of your cloned project when it is ready, as shown below.

1473

Learn more below on the available options once the project is copied.

Filter a VCF file

  1. Click the Apps tab.
  2. Click the run icon next to the Smart Variant Filtering tool.
  3. Click Select files next to a file input and choose the files in the following manner (all input files are available after copying the public project):
  • Model for filtering SNVs or table to perform learning - choose model_7_features_snv.sav.
  • Model for filtering indels or table to perform learning - choose model_7_features_indel.sav.
  • VCF to be filtered - choose the VCF file that you want to filter.
  1. Click Run.

Once the task is completed, the output file, a filtered VCF file created by the tool, will be available in the Output column.

Filter large VCF files

To filter large VCF files, use the Apply Smart Variant Filtering Parallel workflow which performs filtering by parallelizing the process per chromosome. All required input files are available in your project after copying the public project.

  1. Click the Apps tab.
  2. Click the run icon next to the Apply Smart Variant Filtering Parallel workflow, which will create a draft task.
  3. Click Select file(s) next to a file input and choose the files in the following manner:
  • dbsnp - choose dbsnp_147.tab.vcf.gz
  • genome_bed_file_for_scatter - choose human_g1k_v37_decoy.breakpoints.bed
  • indel_model - choose model_6_features_indel.sav
  • reference - choose human_g1k_v37_decoy.fasta
  • snv_model - choose model_6_features_snv.sav
  • vcf - choose the VCF that you want to filter.
  1. Click Run.

Once the task is completed, the output file, a filtered VCF file created by the workflow, will be available in the Output column.

Train a model

To train a model that will be used for filtering a VCF, use the Smart Variant Filtering tool and provide it with tables which contain the features. All required input files are available in your project after copying the public project.

  1. Click the Apps tab.
  2. Click the run icon next to the Smart Variant Filtering tool.
  3. Click Select files next to a file input and choose the files in the following manner:
  • Model for filtering SNVs or table to perform learning - choose annotated_HG003_oslo_exome.tab.vcf_SNVs.table
  • Model for filtering indels or table to perform learning - choose annotated_HG003_oslo_exome.tab.vcf_indels.table
  • VCF to be filtered - choose the VCF file that you want to filter.
  1. In the App Settings column, specify the machine learning algorithms:
  • Machine learning algorithm for SNVs and its params - enter the classifier as well as the parameter set as comma separated values (e.g. MLP,250,logistic,sgd)
  • Machine learning algorithm for Indels and its params - enter the classifier as well as the parameter set as comma separated values (e.g. MLP,250,logistic,sgd)
  1. Click Run.

Once the task is completed, the output file will be available in the Output column. The result is a trained model for both SNVs and indels.

Supported classifiers and parameter sets

The currently supported classifiers and its parameters are listed in the table below.

ClassifierParameter set
ADAn_estimators, learning_rate,algorithm
KNNneighbors,algorithms,p_distance
SVMC,kernels
RFn_estimators, criterion
QDtol
MLPhidden_layer_sizes, activation,solver

Train a model, filter variants and test the results

The entire process of training a model, applying a variant filter and benchmarking the obtained results can be done by running Smart Variant Filtering - Train, filter and test workflow. All required input files are available in your project after copying the public project.

  1. Click the Apps tab.
  2. Click the run icon next to the Smart Variant Filtering - Train, filter and test workflow.
  3. Click Select files next to a file input and choose the files in the following manner:
  • dbsnp - choose dbsnp_147.tab.vcf.gz
  • genome_bed_file - choose genome_bed_filehuman_g1k_v37_decoy.breakpoints.bed
  • indel_tables - choose the following files:
    • annotated_ERR17432.tab.150x.vcf_indels.table
    • annotated_HG001-NA12878-50x.tab.vcf_indels.table
    • annotated_HG003.tab.hs37d5.60x.1.converted.vcf_indels.table
    • annotated_HG004.tab.hs37d5.60x.1.converted.vcf_indels.table
    • annotated_HG005.tab.150424_S1.vcf_indels.table
    • annotated_NA12878_CEPH_30x_ERR194147.tab.vcf_indels.table
    • annotated_NA12878_V2.tab.5_Robot_1_R.vcf_indels.table
  • reference - choose human_g1k_v37_decoy.fasta
  • region_bed_for_vcf_benchmark - choose HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed
  • sdf_template - choose 1000g_v37_phase2.sdf.zip
  • snv_tables - choose the following files:
    • annotated_ERR17432.tab.150x.vcf_SNVs.table
    • annotated_HG001-NA12878-50x.tab.vcf_SNVs.table
    • annotated_HG003.tab.hs37d5.60x.1.converted.vcf_SNVs.table
    • annotated_HG004.tab.hs37d5.60x.1.converted.vcf_SNVs.table
    • annotated_HG005.tab.150424_S1.vcf_SNVs.table
    • annotated_NA12878_CEPH_30x_ERR194147.tab.vcf_SNVs.table
    • annotated_NA12878_V2.tab.5_Robot_1_R.vcf_SNVs.table
  • truth_vcf - choose HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf
  • truthset_bedfile - choose HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed
  • vcf - choose HG002-NA24385-50x.vcf
  1. Click Run.

The result of this task once it is completed will be trained models for SNVs and Indels, a filtered VCF and precision/recall compared to the truth set VCF.

Use a subset of the data

Instead of cloning the entire project, you can choose to select and copy a subset of the data.

  1. Access the public project by selecting Smart Variant Filtering from Public projectsand clicking on its title in the public projects gallery. You'll be taken to the project dashboard of the Smart Variant Filtering public project, as shown below.
1473
  1. Click the Files tab in the upper lett corner. This will take you to the Files page for the Smart Variant Filtering project, as shown below.
1473
  1. Filter the files or search them by:
  • Keywords - You can use the search bar at the top of the page to find files by entering the file name or notes associated with a file.
  • Metadata fields - Next to the search bar, you will see drop-down menus for the metadata fields Investigation, File extension, and Sample ID. Selecting a particular metadata value from one of these menus displays only files that match the value.
  1. You can choose specific files by selecting the corresponding checkbox in front of the file name.
  2. Select as many files as you desire and click Copy to.
  3. Select a project from the drop-down menu.

Now, you can start using the Smart Variant Filtering files you've added to your personal project in your own analysis.