Smart Variant Filtering

Overview

Smart Variant Filtering (SVF) uses machine learning algorithms trained on features from the existing Genome In A Bottle (GIAB) variant-called samples (HG001-HG005) to perform variant filtering (classification).

Smart Variant Filtering increases the precision of called SNVs (removes false positives) for up to 0.2% while keeping the overall f-score higher by 0.12-0.27% than in existing solutions. Indel precision is increased by up to 7.8%, while the f-score increase is in range of 0.1 to 3.2%.

Use the Smart Variant Filtering public project

All Seven Bridges Platform users automatically have copy permissions for this project. This means you can copy the available data to your own projects on the Platform to execute analyses.

You have the options to:

Copy the entire project - Start from the copied project and use available apps to filter a VCF file.
Select and copy a subset of the data to your own project - Use the selected data within your own analyses in your project.

Access the Smart Variant Filtering public project

To access the Smart Variant Filtering public project:

Click on Public projects from the top navigation bar.
Click the title of the project, as shown below.

You'll be taken to the main dashboard of the Smart Variant Filtering public project. Alternatively, you can choose to copy the project, by clicking Copy project. See below for more information.

Copy the entire project

Click Public projects in the top navigation bar.
Locate the Smart Variant Filtering public project and click Copy project in the lower right corner.

In the pop-up window, you can name your copy of the project, select the billing group and specify whether this project will contain controlled data.
Once you've customized the details, click Copy to copy the entire project.

You'll be redirected to the dashboard of your cloned project when it is ready, as shown below.

Learn more below on the available options once the project is copied.

Filter a VCF file

Click the Apps tab.
Click the run icon next to the Smart Variant Filtering tool.
Click Select files next to a file input and choose the files in the following manner (all input files are available after copying the public project):

Model for filtering SNVs or table to perform learning - choose model_7_features_snv.sav.
Model for filtering indels or table to perform learning - choose model_7_features_indel.sav.
VCF to be filtered - choose the VCF file that you want to filter.

Click Run.

Once the task is completed, the output file, a filtered VCF file created by the tool, will be available in the Output column.

Filter large VCF files

To filter large VCF files, use the Apply Smart Variant Filtering Parallel workflow which performs filtering by parallelizing the process per chromosome. All required input files are available in your project after copying the public project.

Click the Apps tab.
Click the run icon next to the Apply Smart Variant Filtering Parallel workflow, which will create a draft task.
Click Select file(s) next to a file input and choose the files in the following manner:

dbsnp - choose dbsnp_147.tab.vcf.gz
genome_bed_file_for_scatter - choose human_g1k_v37_decoy.breakpoints.bed
indel_model - choose model_6_features_indel.sav
reference - choose human_g1k_v37_decoy.fasta
snv_model - choose model_6_features_snv.sav
vcf - choose the VCF that you want to filter.

Click Run.

Once the task is completed, the output file, a filtered VCF file created by the workflow, will be available in the Output column.

Train a model

To train a model that will be used for filtering a VCF, use the Smart Variant Filtering tool and provide it with tables which contain the features. All required input files are available in your project after copying the public project.

Click the Apps tab.
Click the run icon next to the Smart Variant Filtering tool.
Click Select files next to a file input and choose the files in the following manner:

Model for filtering SNVs or table to perform learning - choose annotated_HG003_oslo_exome.tab.vcf_SNVs.table
Model for filtering indels or table to perform learning - choose annotated_HG003_oslo_exome.tab.vcf_indels.table
VCF to be filtered - choose the VCF file that you want to filter.

In the App Settings column, specify the machine learning algorithms:

Machine learning algorithm for SNVs and its params - enter the classifier as well as the parameter set as comma separated values (e.g. MLP,250,logistic,sgd)
Machine learning algorithm for Indels and its params - enter the classifier as well as the parameter set as comma separated values (e.g. MLP,250,logistic,sgd)

Click Run.

Once the task is completed, the output file will be available in the Output column. The result is a trained model for both SNVs and indels.

Supported classifiers and parameter sets

The currently supported classifiers and its parameters are listed in the table below.

Classifier	Parameter set
`ADA`	`n_estimators, learning_rate,algorithm`
`KNN`	`neighbors,algorithms,p_distance`
`SVM`	`C,kernels`
`RF`	`n_estimators, criterion`
`QD`	`tol`
`MLP`	`hidden_layer_sizes, activation,solver`

Train a model, filter variants and test the results

The entire process of training a model, applying a variant filter and benchmarking the obtained results can be done by running Smart Variant Filtering - Train, filter and test workflow. All required input files are available in your project after copying the public project.

Click the Apps tab.
Click the run icon next to the Smart Variant Filtering - Train, filter and test workflow.
Click Select files next to a file input and choose the files in the following manner:

dbsnp - choose dbsnp_147.tab.vcf.gz
genome_bed_file - choose genome_bed_filehuman_g1k_v37_decoy.breakpoints.bed
indel_tables - choose the following files:
- annotated_ERR17432.tab.150x.vcf_indels.table
- annotated_HG001-NA12878-50x.tab.vcf_indels.table
- annotated_HG003.tab.hs37d5.60x.1.converted.vcf_indels.table
- annotated_HG004.tab.hs37d5.60x.1.converted.vcf_indels.table
- annotated_HG005.tab.150424_S1.vcf_indels.table
- annotated_NA12878_CEPH_30x_ERR194147.tab.vcf_indels.table
- annotated_NA12878_V2.tab.5_Robot_1_R.vcf_indels.table
reference - choose human_g1k_v37_decoy.fasta
region_bed_for_vcf_benchmark - choose HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed
sdf_template - choose 1000g_v37_phase2.sdf.zip
snv_tables - choose the following files:
- annotated_ERR17432.tab.150x.vcf_SNVs.table
- annotated_HG001-NA12878-50x.tab.vcf_SNVs.table
- annotated_HG003.tab.hs37d5.60x.1.converted.vcf_SNVs.table
- annotated_HG004.tab.hs37d5.60x.1.converted.vcf_SNVs.table
- annotated_HG005.tab.150424_S1.vcf_SNVs.table
- annotated_NA12878_CEPH_30x_ERR194147.tab.vcf_SNVs.table
- annotated_NA12878_V2.tab.5_Robot_1_R.vcf_SNVs.table
truth_vcf - choose HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_triophased.vcf
truthset_bedfile - choose HG002_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-22_v.3.3.2_highconf_noinconsistent.bed
vcf - choose HG002-NA24385-50x.vcf

Click Run.

The result of this task once it is completed will be trained models for SNVs and Indels, a filtered VCF and precision/recall compared to the truth set VCF.

Use a subset of the data

Instead of cloning the entire project, you can choose to select and copy a subset of the data.

Access the public project by selecting Smart Variant Filtering from Public projectsand clicking on its title in the public projects gallery. You'll be taken to the project dashboard of the Smart Variant Filtering public project, as shown below.

Click the Files tab in the upper lett corner. This will take you to the Files page for the Smart Variant Filtering project, as shown below.

Filter the files or search them by:

Keywords - You can use the search bar at the top of the page to find files by entering the file name or notes associated with a file.
Metadata fields - Next to the search bar, you will see drop-down menus for the metadata fields Investigation, File extension, and Sample ID. Selecting a particular metadata value from one of these menus displays only files that match the value.

You can choose specific files by selecting the corresponding checkbox in front of the file name.
Select as many files as you desire and click Copy to.
Select a project from the drop-down menu.

Now, you can start using the Smart Variant Filtering files you've added to your personal project in your own analysis.