Skip to content

Predicting true low-VAF SNVs in HPV using triplicate NGS samples and machine learning

License

Notifications You must be signed in to change notification settings

NCI-CGR/HPV_low_VAF_SNV_prediction

Repository files navigation

Predicting True Low-VAF Intrahost SNVs in Human Papillomavirus

Table of Contents

  1. Installation
  2. Repository Contents
  3. Usage
  4. License
  5. Contact
  6. Acknowledgements

The repository includes code and datasets pertaining to methods we developed for identifying true intrahost single nucleotide variants (iSNVs) in the human papillomavirus (HPV).

Installation

The code in this repository is written in Python (v3.8) You will need access to a Linux terminal (even a Windows 10 having Windows Subsystem for Linux enabled and Ubuntu installed) to be able to execute the code. The specific instructions below work best in a Ubuntu 18.02/Ubuntu20.04 platform. Follow the steps below to install the required packages and run the scripts.

  1. Clone the repo
  git clone https://github.com/NCI-CGR/HPV_low_VAF_SNV_prediction.git
  1. Install miniconda
  wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  chmod +x Miniconda3-latest-Linux-x86_64.sh
  ./Miniconda3-latest-Linux-x86_64.sh

  Follow the on-screen instructions to complete installation of miniconda and to initialize conda.

  1. Create a conda environment

  Create a conda environment that will include all the required packages necessary for running the repository code. We will use the hpv_triplicate_env.yaml file for this purpose.

  conda env create -f hpv_triplicate_env.yaml
  1. Activate the conda environment
  conda activate hpv_triplicate_env
  1. Install the vcf_parser package from https://github.com/moonso/vcf_parser

  2. Make sure you are good to go

  cd HPV_low_VAF_SNV_prediction # go to the repo directory that you just cloned in Step 1
  python scripts/split_multi_allelic_var.py --help

  The following help menu should be displayed

Usage: split_multi_allelic_var.py [OPTIONS]

Options:
  --vcf_dir TEXT                The directory containing the vcf files (with
                                the headers)  [required]
  --output_dir TEXT             The directory to which the parsed files will
                                be written  [required]
  --num_metadata_lines INTEGER  The number of meta data lines in the vcf files
                                (excluding the header line). Meta data lines
                                begin with "##". Default is 70 for the raw
                                vcfs generated by our pipeline. However, for
                                the filtered VCF files generated by VCFgenie
                                it is 86.  [required]
  --help                        Show this message and exit.

Repository Contents

The contents of the repository are described below.

  • data/ - The data directory includes the raw and the processed variant data that is required for training and testing the machine learning models. Within this directory, the vcf_files/raw_vcf and vcf_files/vcfgenie_vcf subdirectories include the raw VCFs generated by our in-house variant calling pipeline and the filtered VCFs generated upon post-filtering of the raw VCFs by VCFgenie, respectively. The SNV_data_wo_vcf_genie.csv contains all the raw iSNVs from our in-house pipeline and their features. The SNV_data_with_vcf_genie.csv includes iSNVs extracted from the processed VCF files generated by VCFgenie. The PERCENT_OVERLAP indicates the replicate frequency of each iSNV (1 => 3/3, 0.67 => 2/3, 0.33 => 1/3) and the AF is the VAF for a iSNV. The HPV18_not_padded.fasta file includes the reference genome sequence for HPV18. Files initial_iSNVs_p*VCFgenie_maskID.txt and HPV18REF_*mer_counts.tsv are raw input data used by the script VCFgenie-spectra-analyses.R to analyze VCFgenie results and molecular spectra of iSNVs.

  • scripts/ - The split_multi_allelic_var.py script is used to parse through a directory having VCF files, split multi-allelic positions into separate rows, only extract iSNVs and their sequencing features and write the iSNVs and their features into a csv file. Note: to run this script, you must first download and install the vcf_parser package from https://github.com/moonso/vcf_parser . The data/SNV_data_wo_vcf_genie.csv and data/SNV_data_with_vcf_genie.csv files were generated using this script. Checkout the command python scripts/split_multi_allelic_var.py --help to explore how to use this script. Finally, VCFgenie-spectra-analyses.R is an R script to be run manually line-by-line to analyze VCFgenie results and molecular spectra of iSNVs; input files available from the /data/ directory.

  • machine_learning/ - This directory includes code related to machine learning. The xgb_model_training_and_performance_new.ipynb is an interactive jupyter notebook intended to guide the user through step-by-step instructions on how to reproduce the study results. It is intended to demonstrate to the user how training and testing were performed for the 3 scenarios: a. Machine learning with VAF filters (FM models) , b. Machine learning with VCFgenie without any low-VAF filters (VM models) , and c. Machine learning with VCFgenie with low-VAF filters (FVM models). The predict_true_low_vaf_snvs.py script is a more independent script that can be used for predicting true/false iSNVs using the pre-trained models (doesn't require prior model training). More on this script is elaborated in the Usage section of this document The xgb_hyper_param_tuning_parallelized_new.py script is to be used to train models with different hyperparameters (hyperparameter tuning). The merge_hyper_param_perf_data.py script can be used to merge the files generated from hyperparameter tuning and create a single consolidated output. The get_feature_imp.ipynb notebook can be used to reproduce the feature importance calculated for the optimum parameter/feature combinations and hyperparameters for the VM strategy.

  • FM_models/ - Includes trained XGB models for the FM scenario. Only includes models trained for the optimum parameter/feature combination.

  • VM_models/ - Includes trained XGB models for the VM scenario. Only includes models trained for the optimum parameter/feature combination.

  • FVM_models/ - Includes trained XGB models for the FVM scenario. Only includes models trained for the optimum parameter/feature combination.

  • results/ - Includes performance files for models trained in each of the 3 scenarios - FM, VM and FVM. Files are generated upon running the jupyter notebook (machine_learning/xgb_model_training_and_performance.ipynb). For each scenario, the performance metrics are reported for models trained for each parameter/feature combination. Note that for each combination, performance metrics are reported for 50 testing datasets generated by random sampling using 50 different seeds. When calculating overall performance for a combination, we consider the median performance across all the 50 testing datasets.

  • variant_calling_pipeline/ - Includes the snakemake workflow for calling variants for HPV18

Usage

Now we will go through some specific examples of how to use this repository.

Example 1: Reproducing study results

To reproduce the results included in our study, simply run the jupyter notebook: machine_learning/xgb_model_training_and_performance_new.ipynb. For each scenario (FM, VM and FVM), the notebook will train and test XGBoost models on all possible parameter/feature combinations and write the performance metrics into results directory. To reduce the execution time, you can skip running the sections that correspond to saving models for the best performing parameter/feature combination in each scenario (i.e., the section titled "Save the best performing models for each prediction strategy" and all the code following this section). These sections have already been executed and the models saved in their respective directories.

Following is an example of the performance output. Below is a snapshot of the performance results for the FM model as observed in the performance_FM.csv file. Specifically, I am showing the performance for the optimal parameter/feature combination for FM by applying the following filters: True_Var_Def = 1 (i.e., 3/3 replicates), False_Var_Def = 0.33 (i.e., 1/3 replicates), VAF_Lower_Limit = 0.01, VAF_Upper_Limit = 0.5, Feature_Category = Strict. Note: I am only showing the first 10 rows.

As you will observe there are 50 rows after applying the above filters, each row reports performance on a separate testing dataset obtained by randomly sampling with a different seed. The overall performance for this combination is obtained by calculating the median of each metric column (and is reported in the results/cumulative_perf_metrics directory).

Seed True_Var_Def False_Var_Def VAF_Lower_Limit VAF_Upper_Limit Feature_Category Num_True_Var_Testing Num_False_Var_Testing TP FP TN FN AUC MCC F1_score Accuracy MSE
10 [1] [0.33] 0.01 0.5 Strict 68 662 53 105 557 15 0.810401 0.438 0.469 0.836 0.164
20 [1] [0.33] 0.01 0.5 Strict 66 603 44 60 543 22 0.783582 0.467 0.518 0.877 0.123
3247 [1] [0.33] 0.01 0.5 Strict 205 1581 74 187 1394 131 0.621348 0.219 0.318 0.822 0.178
24 [1] [0.33] 0.01 0.5 Strict 98 776 79 220 556 19 0.761309 0.348 0.398 0.727 0.273
4501 [1] [0.33] 0.01 0.5 Strict 66 717 46 111 606 20 0.771079 0.376 0.413 0.833 0.167
25 [1] [0.33] 0.01 0.5 Strict 76 1026 42 211 815 34 0.673489 0.209 0.255 0.778 0.222
79 [1] [0.33] 0.01 0.5 Strict 63 748 38 232 516 25 0.646507 0.166 0.228 0.683 0.317
299 [1] [0.33] 0.01 0.5 Strict 43 667 16 313 354 27 0.451414 -0.046 0.086 0.521 0.479
1001 [1] [0.33] 0.01 0.5 Strict 156 765 86 163 602 70 0.669105 0.286 0.425 0.747 0.253
287 [1] [0.33] 0.01 0.5 Strict 33 157 33 35 122 0 0.888535 0.614 0.653 0.816 0.184

Example 2: Predicting true iSNVs using the raw VCF files

In this example, we will learn how to predict true/false iSNVs using the raw VCF files generated by our in-house script. We will use the raw vcf files provided in the data/vcf_files/raw_vcf directory for this purpose.

As the first step, you will need to process the raw vcf files and convert them into a csv file using scripts/split_multi_allelic_var.py script. Make sure your current directory is HPV_low_VAF_SNV_prediction and you have activated the conda environment as described in Step 4 of the Installation section. Then use the following commands.

mkdir example2
cd example2
python ../scripts/split_multi_allelic_var.py --vcf_dir ../data/vcf_files/raw_vcf/ --output_dir raw_vcf_parsed --num_metadata_lines 70

You should see the following output (I have truncated the output as it was long!)

Processing file ../data/vcf_files/raw_vcf/sample10_D4.tvc.vcf (1/93)...Done!
Processing file ../data/vcf_files/raw_vcf/sample10_D5.tvc.vcf (2/93)...Done!
Processing file ../data/vcf_files/raw_vcf/sample10_D6.tvc.vcf (3/93)...Done!
Processing file ../data/vcf_files/raw_vcf/sample11_H1.tvc.vcf (4/93)...Done!
Processing file ../data/vcf_files/raw_vcf/sample11_H2.tvc.vcf (5/93)...Done!
...
...
...
Processing file ../data/vcf_files/raw_vcf/sample9_H7.tvc.vcf (91/93)...Done!
Processing file ../data/vcf_files/raw_vcf/sample9_H8.tvc.vcf (92/93)...Done!
Processing file ../data/vcf_files/raw_vcf/sample9_H9.tvc.vcf (93/93)...Done!
Splitting MNPs...Done!
Extracting only SNVs...Done!
Total SNVs = 9225. Wrote SNVs to raw_vcf_parsed/snvs.csv

The total number of iSNVs identified is 9,225 and the iSNVs and their features have been written to the file raw_vcf_parsed/snvs.csv

Each row in raw_vcf_parsed/snvs.csv is a iSNV. We will now consider that the raw_vcf_parsed/snvs.csv file is our hypothetical independent test dataset and our goal is to predict true and false low-VAF SNVs on this dataset. We will achieve this goal by using the machine_learning/predict_true_low_vaf_snvs.py script. Checkout the usage of this script as follows.

python ../machine_learning/predict_true_low_vaf_snvs.py --help

Usage: predict_true_low_vaf_snvs.py [OPTIONS]

Options:
  --snv_file TEXT           The csv file generated by the
                            split_multi_allelic_var.py script.For VM and FVM
                            models, remember to filter the vcf files with
                            VCFgenie before running the
                            split_multi_allelic_var.py script.  [required]
  --model_type [VM|FM|FVM]  The model type to be used for making the
                            prediction  [default: VM; required]
  --hpv_ref_file TEXT       The reference genome file for the HPV type. File
                            must contain the unpadded genome sequence in fasta
                            format (i.e., genome sequence should not include a
                            duplication of bases in the start region.)
                            [required]
  --outfile TEXT            Name of the output file that will contain the
                            prediction results  [required]
  --help                    Show this message and exit.

We will make predictions using the FM model. Run the following command and you will see the output that follows the command.

python ../machine_learning/predict_true_low_vaf_snvs.py --snv_file raw_vcf_parsed/snvs.csv --model_type FM --hpv_ref_file ../data/HPV18_not_padded.fasta --outfile FM_predictions.csv

Extracting SNVs...Done!
Calculating features ...Done!
Running predictions...Done!
Wrote predictions to FM_predictions.csv

The prediction output is written to FM_predictions.csv . As you will observe, the script only made predictions for 4,590/9,225 iSNVs. That's because the prediction is only performed for iSNVs with VAF 1% - 60% (the optimal VAF range for FM models), having a minimum FDP of 100 and FAO of 1. The Predicted_SNV_Type column is a binary column that includes the prediction : TRUE => Predicted true iSNVs, FALSE => Predicted false iSNVs. The True_SNV_Probability column provides a probability value for the iSNV to be true - higher probability means higher likelihood of the iSNV to be a true iSNV. iSNVs with probabilities < 0.5 are labelled as false iSNVs.

Example 3: Predicting true iSNVs using processed VCF files generated by VCFgenie

In this example, we will learn how to predict true/false iSNVs using the filtered VCF files generated by VCFgenie. We will use the raw vcf files provided in the data/vcf_files/raw_vcf directory for this purpose.

The first step is to run VCFgenie on the raw vcf files. VCFgenie requires the user to provide a p value threshold as a runtime argument. The p value in this scenario is the Bonferroni-corrected p value and can be calculated by dividing 0.05 by the total number of iSNVs (in our case it's 9,225 as calculated by the split_multi_allelic_var.py script). The corrected p value would therefore be 0.000005420054201. We will also need the platform error rate - for Ion Torrent it's 0.00012936. We will now use the following commands to install VCFgenie, execute it and generate the filtered VCF files. Before running these commands make sure that you are in the HPV_low_VAF_SNV_prediction directory.

mkdir example3
cd example3
git clone https://github.com/chasewnelson/VCFgenie.git
# Copy the raw vcf files to the current dir
cp -rf ../data/vcf_files/raw_vcf/ .
cd raw_vcf
python ../VCFgenie/VCFgenie.py --error_rate=0.00012936 --p_cutoff=0.000005420054201 --AC_key=FAO --AC_key_new=FAO --AF_key=AF --AF_key_new=AF --DP_key=FDP --overwrite_INFO --VCF_files sample*.vcf > vcfgenie.out

You will notice that once VCFgenie has finished running, a directory VCFgenie_output is created that contains the filtered/processed vcf files. The log output from VCFgenie can be found in the file vcfgenie.out.

Next, we will compile a csv file from the vcf files that will only contain iSNVs. As in Example 2, we will use the split_multi_allelic_var.py script to perform this operation.

# Go back to the parent directory
cd ..
python ../scripts/split_multi_allelic_var.py --vcf_dir ./raw_vcf/VCFgenie_output --output_dir VCFgenie_SNVs --num_metadata_lines 86

Note that we changed the num_metadata_lines to 86 compared to example 2 as the VCFgenie-processed VCF files have 16 additional metadata lines. You should see the following output (I have truncated the output as it was long!)

Processing file VCFgenie_output/sample10_D4.tvc_filtered.vcf (1/93)...Done!
Processing file VCFgenie_output/sample10_D5.tvc_filtered.vcf (2/93)...Done!
Processing file VCFgenie_output/sample10_D6.tvc_filtered.vcf (3/93)...Done!
Processing file VCFgenie_output/sample11_H1.tvc_filtered.vcf (4/93)...Done!
Processing file VCFgenie_output/sample11_H2.tvc_filtered.vcf (5/93)...Done!
Processing file VCFgenie_output/sample11_H3.tvc_filtered.vcf (6/93)...Done!
...
...
...
Processing file VCFgenie_output/sample8_G7.tvc_filtered.vcf (88/93)...Done!
Processing file VCFgenie_output/sample8_G8.tvc_filtered.vcf (89/93)...Done!
Processing file VCFgenie_output/sample8_G9.tvc_filtered.vcf (90/93)...Done!
Processing file VCFgenie_output/sample9_H7.tvc_filtered.vcf (91/93)...Done!
Processing file VCFgenie_output/sample9_H8.tvc_filtered.vcf (92/93)...Done!
Processing file VCFgenie_output/sample9_H9.tvc_filtered.vcf (93/93)...Done!
Splitting MNPs...Done!
Extracting only SNVs...Done!
Total SNVs = 9225. Wrote SNVs to VCFgenie_SNVs/snvs.csv

Finally, we will predict true iSNVs with the VM models and observe the output below.

python ../machine_learning/predict_true_low_vaf_snvs.py --snv_file VCFgenie_SNVs/snvs.csv --model_type VM --hpv_ref_file ../data/HPV18_not_padded.fasta --outfile VM_predictions.csv

Extracting SNVs...Done!
Calculating features ...Done!
Running predictions...Done!
Wrote predictions to VM_predictions.csv

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Sambit Mishra - sambit.mishra@nih.gov

Project Link: https://github.com/NCI-CGR/HPV_low_VAF_SNV_prediction

Acknowledgements

The work is an outcome of a collaborative effort between the researchers at the Cancer Genomics Research Laboratory (CGR), Frederick National Laboratory and the principal investigators and postdoctoral fellows at the Division of Cancer Epidemiology and Genetics (DCEG). I would like to gratefully acknowledge the scientific efforts and research guidance received from Dr. Lisa Mirabello, Dr. Meredith Yeager, Dr. Laurie Burdette, Dr. Michael Dean, Dr. Bin Zhu, Dr. Chase Nelson, Dr. Hyo Jung Lee and Dr. Maisa Pinheiro without which this work wouldn't be complete.