This document explains how to quickly start using DeepTrio to generate variant calls for trio samples. This tutorial does not cover all possible settings of DeepTrio. It is intended to be a starting point for using DeepTrio.
To get started, we've provided a Docker image, and some test data in a bucket on Google Cloud Storage. The instructions below show how to download the data through the corresponding public URLs.
This setup requires a machine with the AVX instruction set. To see if your
machine meets this requirement, you can check the /proc/cpuinfo
file, which
lists this information under "flags". If you do not have the necessary
instructions, see the next section for more information on how to build your own
Docker image.
Although DeepTrio can be built from a source, we provide a docker image that allows to run through all steps in one command to generate VCF/gVCF output files from input BAM files and the reference.
If you want to compile the binaries for yourself, we also have a Dockerfile that you can use to build your own Docker image. You can read the docker build documentation on how to build.
BIN_VERSION="1.6.1"
sudo apt -y update
sudo apt-get -y install docker.io
sudo docker pull google/deepvariant:deeptrio-"${BIN_VERSION}"
Before you start, you need to have the following input files:
-
A reference genome in FASTA format and its corresponding index file (.fai).
-
For each sample, one aligned reads file in BAM format and its corresponding index file (.bai). You get this by aligning the reads from a sequencing instrument, using an aligner like BWA for example.
We've prepared a small test data bundle for use in this quick start guide that can be downloaded to your instance from the public URLs.
Download the test bundle:
INPUT_DIR="${PWD}/quickstart-testdata"
mkdir -p ${INPUT_DIR}
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio
curl ${FTPDIR}/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > "${INPUT_DIR}"/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > "${INPUT_DIR}"/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > "${INPUT_DIR}"/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
curl ${FTPDIR}/HG003_NA24149_father/NISTv4.2.1/GRCh38/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > "${INPUT_DIR}"/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG003_NA24149_father/NISTv4.2.1/GRCh38/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > "${INPUT_DIR}"/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG003_NA24149_father/NISTv4.2.1/GRCh38/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > "${INPUT_DIR}"/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
curl ${FTPDIR}/HG004_NA24143_mother/NISTv4.2.1/GRCh38/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed > "${INPUT_DIR}"/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
curl ${FTPDIR}/HG004_NA24143_mother/NISTv4.2.1/GRCh38/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz > "${INPUT_DIR}"/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
curl ${FTPDIR}/HG004_NA24143_mother/NISTv4.2.1/GRCh38/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi > "${INPUT_DIR}"/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
HTTPDIR=https://storage.googleapis.com/deepvariant/quickstart-testdata
wget -P ${INPUT_DIR} "${HTTPDIR}"/HG002.chr20.10_10p1mb.bam
curl "${HTTPDIR}/HG002.chr20.10_10p1mb.bam" > "${INPUT_DIR}/HG002.chr20.10_10p1mb.bam"
curl "${HTTPDIR}/HG002.chr20.10_10p1mb.bam.bai" > "${INPUT_DIR}/HG002.chr20.10_10p1mb.bam.bai"
curl "${HTTPDIR}/HG003.chr20.10_10p1mb.bam" > "${INPUT_DIR}/HG003.chr20.10_10p1mb.bam"
curl "${HTTPDIR}/HG003.chr20.10_10p1mb.bam.bai" > "${INPUT_DIR}/HG003.chr20.10_10p1mb.bam.bai"
curl "${HTTPDIR}/HG004.chr20.10_10p1mb.bam" > "${INPUT_DIR}/HG004.chr20.10_10p1mb.bam"
curl "${HTTPDIR}/HG004.chr20.10_10p1mb.bam.bai" > "${INPUT_DIR}/HG004.chr20.10_10p1mb.bam.bai"
FTPDIR=ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz | gunzip > "${INPUT_DIR}"/GRCh38_no_alt_analysis_set.fasta
curl ${FTPDIR}/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > "${INPUT_DIR}"/GRCh38_no_alt_analysis_set.fasta.fai
This should create a subdirectory in the current directory containing the actual data files:
ls -1 ${INPUT_DIR}
output:
GRCh38_no_alt_analysis_set.fasta
GRCh38_no_alt_analysis_set.fasta.fai
HG002.chr20.10_10p1mb.bam
HG002.chr20.10_10p1mb.bam.bai
HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
HG003.chr20.10_10p1mb.bam
HG003.chr20.10_10p1mb.bam.bai
HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
HG004.chr20.10_10p1mb.bam
HG004.chr20.10_10p1mb.bam.bai
HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz.tbi
HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed
We create one entrypoint that can be directly run as a docker command. If you want to see the details, you can read through run_deeptrio.py.
OUTPUT_DIR="${PWD}/quickstart-output"
mkdir -p "${OUTPUT_DIR}"
You can run everything with the following command:
sudo docker run \
-v "${INPUT_DIR}":"/input" \
-v "${OUTPUT_DIR}":"/output" \
google/deepvariant:deeptrio-"${BIN_VERSION}" \
/opt/deepvariant/bin/deeptrio/run_deeptrio \
--model_type=WGS \
--ref=/input/GRCh38_no_alt_analysis_set.fasta \
--reads_child=/input/HG002.chr20.10_10p1mb.bam \
--reads_parent1=/input/HG003.chr20.10_10p1mb.bam \
--reads_parent2=/input/HG004.chr20.10_10p1mb.bam \
--output_vcf_child /output/HG002.output.vcf.gz \
--output_vcf_parent1 /output/HG003.output.vcf.gz \
--output_vcf_parent2 /output/HG004.output.vcf.gz \
--sample_name_child 'HG002' \
--sample_name_parent1 'HG003' \
--sample_name_parent2 'HG004' \
--num_shards $(nproc) \
--regions "chr20:10,000,000-10,010,000" \
--intermediate_results_dir /output/intermediate_results_dir \
--output_gvcf_child /output/HG002.g.vcf.gz \
--output_gvcf_parent1 /output/HG003.g.vcf.gz \
--output_gvcf_parent2 /output/HG004.g.vcf.gz
NOTE: If you want to look at all the commands being run, you can add
--dry_run=true
to the command above, which will print out all the commands
but not execute them.
This will generate 15 files and 1 directory in ${OUTPUT_DIR}
:
ls -1 ${OUTPUT_DIR}
output:
HG002.g.vcf.gz
HG002.g.vcf.gz.tbi
HG002.output.vcf.gz
HG002.output.vcf.gz.tbi
HG002.output.visual_report.html
HG003.g.vcf.gz
HG003.g.vcf.gz.tbi
HG003.output.vcf.gz
HG003.output.vcf.gz.tbi
HG003.output.visual_report.html
HG004.g.vcf.gz
HG004.g.vcf.gz.tbi
HG004.output.vcf.gz
HG004.output.vcf.gz.tbi
HG004.output.visual_report.html
intermediate_results_dir
The directory "intermediate_results_dir" exists because
--intermediate_results_dir /output/intermediate_results_dir
is specified. This
directory contains the intermediate output of make_examples and call_variants
steps.
For more information about the HG00*.output.visual_report.html
files, see the
VCF stats report documentation.
If you are using GPUs, you can pull the GPU version, and make sure you run with
--gpus 1
. call_variants
is the only step that uses the GPU, and can only use
one at a time. make_examples
and postprocess_variants
do not run on GPU.
For an example to install GPU driver and docker, see install_nvidia_docker.sh.
sudo docker run --gpus 1 \
-v "${INPUT_DIR}":"/input" \
-v "${OUTPUT_DIR}:/output" \
google/deepvariant:deeptrio-"${BIN_VERSION}-gpu" \
/opt/deepvariant/bin/deeptrio/run_deeptrio \
...
# Pull the image.
singularity pull docker://google/deepvariant:deeptrio-"${BIN_VERSION}"
# Run DeepTrio.
singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
docker://google/deepvariant:deeptrio-"${BIN_VERSION}" \
/opt/deepvariant/bin/deeptrio/run_deeptrio \
--model_type=WGS \
--ref="${INPUT_DIR}"/GRCh38_no_alt_analysis_set.fasta \
--reads_child="${INPUT_DIR}"/HG002.chr20.10_10p1mb.bam \
--reads_parent1="${INPUT_DIR}"/HG003.chr20.10_10p1mb.bam \
--reads_parent2="${INPUT_DIR}"/HG004.chr20.10_10p1mb.bam \
--output_vcf_child "${OUTPUT_DIR}"/HG002.output.vcf.gz \
--output_vcf_parent1 "${OUTPUT_DIR}"/HG003.output.vcf.gz \
--output_vcf_parent2 "${OUTPUT_DIR}"/HG004.output.vcf.gz \
--sample_name_child 'HG002' \
--sample_name_parent1 'HG003' \
--sample_name_parent2 'HG004' \
--num_shards $(nproc) \
--regions "chr20:10,000,000-10,010,000" \
--intermediate_results_dir "${OUTPUT_DIR}"/intermediate_results_dir \
--output_gvcf_child "${OUTPUT_DIR}"/HG002.g.vcf.gz \
--output_gvcf_parent1 "${OUTPUT_DIR}"/HG003.g.vcf.gz \
--output_gvcf_parent2 "${OUTPUT_DIR}"/HG004.g.vcf.gz
# Pull the image.
singularity pull docker://google/deepvariant:deeptrio-"${BIN_VERSION}-gpu"
# Run DeepTrio.
# Using "--nv" and "${BIN_VERSION}-gpu" is important.
singularity run --nv -B /usr/lib/locale/:/usr/lib/locale/ \
docker://google/deepvariant:deeptrio-"${BIN_VERSION}-gpu" \
/opt/deepvariant/bin/deeptrio/run_deeptrio \
...
Here we use the hap.py
(https://github.com/Illumina/hap.py)
program from Illumina to evaluate the resulting three VCF files (only covering
10 kb of chr20 for this small example). Here the DeepTrio output VCF for each
sample is compared against the corresponding truth set from GIAB.
# Pull docker image locally
sudo docker pull jmcdani20/hap.py:v0.3.12
# Evaluate HG002
sudo docker run -it \
-v "${INPUT_DIR}":"/input" \
-v "${OUTPUT_DIR}:/output" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/input/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/output/HG002.output.vcf.gz \
-f "/input/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed" \
-r "/input/GRCh38_no_alt_analysis_set.fasta" \
-o "/output/HG002.happy" \
--engine=vcfeval \
--pass-only \
-l chr20:10000000-10010000
# Evaluate HG003
sudo docker run -it \
-v "${INPUT_DIR}":"/input" \
-v "${OUTPUT_DIR}:/output" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/input/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/output/HG003.output.vcf.gz \
-f "/input/HG003_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed" \
-r "/input/GRCh38_no_alt_analysis_set.fasta" \
-o "/output/HG003.happy" \
--engine=vcfeval \
--pass-only \
-l chr20:10000000-10010000
# Evaluate HG004
sudo docker run -it \
-v "${INPUT_DIR}":"/input" \
-v "${OUTPUT_DIR}:/output" \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/input/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz \
/output/HG004.output.vcf.gz \
-f "/input/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed" \
-r "/input/GRCh38_no_alt_analysis_set.fasta" \
-o "/output/HG004.happy" \
--engine=vcfeval \
--pass-only \
-l chr20:10000000-10010000
You should see output similar to the following.
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 3 3 0 4 0 0 0 0 1.0 1.0 0.0 1.0 NaN NaN 1.000000 3.000000
INDEL PASS 3 3 0 4 0 0 0 0 1.0 1.0 0.0 1.0 NaN NaN 1.000000 3.000000
SNP ALL 19 19 0 19 0 0 0 0 1.0 1.0 0.0 1.0 5.333333 5.333333 2.166667 2.166667
SNP PASS 19 19 0 19 0 0 0 0 1.0 1.0 0.0 1.0 5.333333 5.333333 2.166667 2.166667
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 3 3 0 3 0 0 0 0 1.0 1.0 0.0 1.0 NaN NaN 0.500000 0.500000
INDEL PASS 3 3 0 3 0 0 0 0 1.0 1.0 0.0 1.0 NaN NaN 0.500000 0.500000
SNP ALL 21 21 0 21 0 0 0 0 1.0 1.0 0.0 1.0 6.0 6.0 1.333333 1.333333
SNP PASS 21 21 0 21 0 0 0 0 1.0 1.0 0.0 1.0 6.0 6.0 1.333333 1.333333
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2 2 0 2 0 0 0 0 1.0 1.0 0.0 1.0 NaN NaN 1.000000 1.000000
INDEL PASS 2 2 0 2 0 0 0 0 1.0 1.0 0.0 1.0 NaN NaN 1.000000 1.000000
SNP ALL 10 10 0 10 0 0 0 0 1.0 1.0 0.0 1.0 4.0 4.0 2.333333 2.333333
SNP PASS 10 10 0 10 0 0 0 0 1.0 1.0 0.0 1.0 4.0 4.0 2.333333 2.333333