diff --git a/.nojekyll b/.nojekyll index 6724e6b1..01552792 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -eba7615a \ No newline at end of file +4852efe0 \ No newline at end of file diff --git a/exercises/compute_environment/index.html b/exercises/compute_environment/index.html index 4692f70a..ef4d9fef 100644 --- a/exercises/compute_environment/index.html +++ b/exercises/compute_environment/index.html @@ -445,8 +445,8 @@

Compute environment

mkdir -p /proj/naiss2023-22-1084/users/YOURUSERNAME
 cd /proj/naiss2023-22-1084/users/YOURUSERNAME

All computations should be run on a compute node. You can request an interactive session with the interactive command. For example, to request an eight hour job on 4 cores, run

-
interactive -A naiss2023-22-1084 --cores 4 \
-   --partition core --time 08:00:00 \
+
interactive -A naiss2023-22-1084 -n 4 \
+   --time 08:00:00 \
    --reservation=naiss2023-22-1084
diff --git a/exercises/datasets/monkeyflowers.html b/exercises/datasets/monkeyflowers.html index a1f9e958..049c62a2 100644 --- a/exercises/datasets/monkeyflowers.html +++ b/exercises/datasets/monkeyflowers.html @@ -380,7 +380,7 @@

Monkeyflowers dataset

-

The allure of monkeyflowers (DOI: 10.1126/science.365.6456.854)

+

The allure of monkeyflowers (DOI: 10.1126/science.365.6456.854)

@@ -391,13 +391,13 @@

Monkeyflowers dataset

Recently, Stankowski et al. (2019) used the monkeyflower system to investigate what forces affect the genomic landscape. Burri (2017) has suggested that background selection (BGS) is one of the main causes for correlations between genomic landscapes, and that one way to study this phenomenon is to look at closely related taxa. This is one of the objectives of the Stankowski et al. (2019) paper.

They performed whole-genome resequencing of 37 individuals from 7 subspecies and 2 ecotypes of Mimulus aurantiacus and its sister taxon M. clevelandii (Figure 1), all sampled in California (Figure 2).

-

+

Figure 1: Evolutionary relationships across the radiation
-

+

Figure 2: Sampling locations

Genomewide statistics, such as diversity (\(\pi\)), divergence (\(d_{XY}\)) and differentiation \(F_{ST}\), were calculated within and between taxa to generate genomic diversity landscapes. The landscapes were highly similar across taxa, and local variation in genomic features, such as gene density and recombination rate, was predictive of variation in landscape patterns. These features suggest the influence of selection, in particular BGS.

@@ -413,7 +413,7 @@

Monkeyflowers dataset

Figure 3 shows typical results of the simulations.

-

+

Figure 3: Genomic landscapes simulated under different divergence histories.

In conclusion, the authors found that although BGS plays a role, it does not sufficiently explain all observations, and that other aspects of natural selection (such as rapid adaptation) are responsible for the similarities between genomic landscapes.

@@ -760,7 +760,7 @@

Monkeyflowers dataset

- + \ No newline at end of file diff --git a/exercises/demography/index.html b/exercises/demography/index.html index d5f12bc9..00ace0c6 100644 --- a/exercises/demography/index.html +++ b/exercises/demography/index.html @@ -1,13 +1,10 @@ - + - - - DDLS Population genomics in practice - Demographic inference - + @@ -61,8 +72,7 @@ - - - - - - +} -
-
-
-
-
- -
-
+

Demographic inference

@@ -385,9 +392,7 @@

Demographic inference

-
- -
+
@@ -432,13 +440,25 @@

Demographic inference

- +
- + +

Move to your course working directory /proj/naiss2023-22-1084/users/USERNAME, create an exercise directory called demography and cd to it:

+ +
cd /proj/naiss2023-22-1084/users/USERNAME
+mkdir -p demography
+cd demography
+

You will run most analyses on Rackham, so just go to your directory to start. You will need to download the results files to you computer afterwards, you can use wget like you’d done multiple times by now.

- +

You will run R, either via R Studio or on the command line, you can choose.

+

I recommend downloading the Rmd file that has all the instructions for this exercise and all the R code necessary to generate the plots we want to see today.

+

Just download it from https://export.uppmax.uu.se/naiss2023-22-1084/demography/demography.Rmd with wget1:

+
wget --user pgip --password PASSWORD https://export.uppmax.uu.se/naiss2023-22-1084/demography/demography.Rmd
@@ -451,52 +471,220 @@

Demographic inference

-Tools +Note
-
- -
- -
-

Execute the following command to load modules:

-
#| label: uppmax-load-modules
-#| echo: true
-#| eval: false
-module load uppmax bioinfo-tools \
-    psmc/0.6.5-r67-e5f7df5
-
-
-

Copy the contents to a file environment.yml and install packages with mamba env update -f environment.yml.

-
channels:
-  - conda-forge
-  - bioconda
-  - default
-dependencies:
-  - psmc
+
+

:::

+
+ +
+
+
+ +
+
+

Everything you need is on Uppmax. Execute the following command to load modules:

+
+
module load bioinfo-tools samtools PCAngsd ANGSD bcftools plink htslib
+
+
+
+
+
+
+
+

Practical information

+

This exercise is divided into two parts. The first part will happen on Rackham. All the files you need are located here: /proj/naiss2023-22-1084/private/demography. The second part will happen inside R Studio locally.

+

We will be using ANGSD (https://github.com/ANGSD/angsd).

+

What has been prepared for you: - Dataset. Data from 3 human populations (TSI, PEL and LWK). - You are also being provided with files with the extension *.filelist. These files contain a list with all the BAM files needed for this exercise.

+

You have two options now: 1) Create a bash script and submit to SLURM; 2) Start an Interactive session on Rackham and run manually each step. I recommend this option for today.

+

How to start interactive mode on Rackham:

+
#| label: uppmax-interactive-mode
+#| echo: true
+#| eval: false
+interactive -A naiss2023-22-1084 -n 4 \
+   --partition core --time 08:00:00 \
+   --reservation=naiss2023-22-1084_4```
+
+You need to load all modules on Rackham so you don't have to worry about
+installing each program that we will need in this exercise.
+
+```bash
+#| label: load-modules
+#| echo: true
+#| eval: false       
+module load bioinfo-tools samtools PCAngsd ANGSD bcftools plink htslib
+

First steps

+

Calculate allele frequencies

+

You will need your file that has a list with all BAM files from all individuals from all populations (/proj/naiss2023-22-1084/private/demography/dataset/ALL.bamlist).

+

Just run:

+
#| label: angsd-basic-command
+#| echo: true
+#| eval: false         
+angsd -b ALL.bamlist -GL 2 -doMajorMinor 1 -doMaf 2 -minMapQ 30 -minQ 20 -nThreads 4
+
-b = your file with your BAM files list
+-GL 2 = This will calculate the genotype likelihood for you. "2" means it will use the same method as GATK. If you use -GL 1 it will use the same method as Samtools.
+-doMajorMinor 1 = Many methods assume that each site is diallelic, so this option will use the genotype likelihood to infer the major and minor alleles.
+-doMaf 2 = It will use the inferred major alleles, but it does not assume known minor alleles.
+-minMapQ 30 = mapping quality filter
+-minQ 20 = base quality filter
+-nThreads 4 = will use 4 threads (cores) to perform all calculations.
+

The output will be angsdput.mafs.gz.

+

Take a look inside the results file. This file is formatted such as:

+
chromo  position        major   minor   unknownEM       nInd
+11      61005493        C       A       0.000006        1
+11      61005494        A       C       0.000006        1
+11      61005495        C       A       0.000006        1
+11      61005496        G       A       0.000006        1
+11      61005497        G       A       0.000006        1
+11      61005498        T       A       0.000006        1
+11      61005499        A       C       0.000006        1
+11      61005500        C       A       0.000006        1
+11      61005501        A       C       0.000006        1
+
chromo = chromosome name
+position = position…
+major = major allele
+minor = minor allele
+knownEM = frequency using -doMaf1
+unknownEM = frequency using -doMaf2
+pK-EM = p-value for the frequency of the minor allele
+nInd = Number of individuals with data
+

Generate the genotype likelihood files

+

You will need your file that has a list with all BAM files from all individuals from all populations (/proj/naiss2023-22-1084/private/demography/dataset/ALL.bamlist).

+

Just run:

+
#| echo: true
+#| eval: false
+angsd -GL 2 -doGlf 2 -b ALL.bamlist -doMajorMinor 1 -SNP_pval 1e-6 -doMaf 1 -nThreads 4
+
-doGlf 2 = will create the beagle likelihood file
+-SNP_pval = it will use 10^-6 as a p-value for the LRT to calculate which sites are polymorphic.
+

The output will be angsdput.mafs.gz and angsdput.beagle.gz

+

You already know what’s inside the mafs.gz file, but let’s take a look inside the Beagle format:

+
marker  allele1 allele2 Ind0    Ind0    Ind0    Ind1    Ind1    Ind1    Ind2    Ind2    Ind2    Ind3    Ind3    Ind3    Ind4    Ind4    Ind4    Ind5    Ind5    Ind5    Ind6    Ind6    Ind6    Ind7    Ind7    Ind7    Ind8    Ind8    Ind8    Ind9    Ind9    Ind9
+1_14000202      2       0       0.000532        0.999468        0.000000        0.333333        0.333333        0.333333        0.333333        0.333333        0.333333        0.666490        0.333333        0.000177        0.333333        0.333333        0.333333        0.002660        0.997340        0.000001        0.799744        0.200255        0.000000        0.000842        0.999157        0.000001        0.888820        0.111180        0.000000        0.799988        0.200012        0.000000
+1_14000873      2       0       0.000000        0.030324        0.969676        0.663107        0.333333        0.003560        0.888698        0.111302        0.000000        0.992114        0.007886        0.000000        0.999013        0.000987        0.000000        0.000009        0.991658        0.008334        0.888726        0.111274        0.000000        0.799696        0.200303        0.000001        0.000000        0.030330        0.969670        0.666490        0.333333        0.000177
+1_14001018      3       1       0.000000        0.015429        0.984571        0.799823        0.200177        0.000000        0.050636        0.949364        0.000000        0.941120        0.058880        0.000000        0.888842        0.111158        0.000000        0.799970        0.200030        0.000000        0.333333        0.333333        0.333333        0.992231        0.007769        0.000000        0.000140        0.333333        0.666526        0.666622        0.333333        0.000044
+

You will see that each individual has three columns in the file.

+
marker = chromosome and position (chr_position)
+allele1 = major allele as: 0=A, 1=C, 2=G, 3=T
+allele2 = minor allele as: 0=A, 1=C, 2=G, 3=T
+Ind0 = 1st column is the genotype likelihood for major/major genotype
+Ind0 = 2nd column is the genotype likelihood for major/minor genotype
+Ind0 = 3rd column is the genotype likelihood for minor/minor genotype
+

SFS estimation

+

Let’s look into the Site Frequency Spectrum (SFS) for these populations. As you saw in the theoretical part earlier today, the SFS will show you the proportions of sites at different allele frequencies. And to calculate the unfolded SFS we will need an outgroup species to be used as the ancestral state. One important caveat to remember is that during out exercises we are using only a portion of one chromosome, but for real world data usage it is important to use the whole genome so you have enough information.

+

We usually run SFS per population, but let’s try running it for all individuals in our dataset first, just out of curiosity.

+

You will need your file that has a list with all BAM files from all individuals from all populations (naiss2023-22-1084/demography/dataset/ALL.bamlist). You will also need /proj/naiss2023-22-1084/private/demography/dataset/anc.fa.gz, which has the ancestral state for your individuals. Let’s calculate the unfolded SFS.

+

First we need the allele frequencies.

+

Just run:

+
#| echo: true
+#| eval: false
+angsd -bam ALL.bamlist -anc anc.fa.gz -doSaf 1 -GL 2 -nThreads 4
+
-doSaf 1 = it will calculate allele frequency likelihood based on individual genotypoe likelihoods assuming Hardy-Weinberg.
+

The output will be angsdput.saf.idx, angsdput.saf.pos.gz, angsdput.saf.gz. If you want to know more about these files, read https://github.com/ANGSD/angsd/blob/newsaf/doc/formats.pdf.

+

But now that we have calculated allele frequencies, let’s calculate the genotype likelihood:

+

Just run:

+
#| echo: true
+#| eval: false
+angsd -bam ALL.bamlist -doSaf 1 -out output_gl -anc anc.fa.gz -GL 2 -P 4 -minMapQ 1 -minQ 20
+
-doSaf 1 = it will calculate allele frequency likelihood based on individual genotypoe likelihoods assuming Hardy-Weinberg.
+-out = choose a name for your output files
+-anc chimpHg19.fa.gz = we have to supply a file with the ancestral state if we want to find unfolded SFS
+-P = just another way to use "nThreads"
+

Now that all calculations have been done, let’s output the SFS based on them. Just run:

+
#| echo: true
+#| eval: false         
+realSFS angsdput.saf.idx -maxIter 100 -P 4 >results.sfs
+

One you have done it, repeat it with our populations: PEL, TSI and LWK. Do not forget to add the option -out name to name your output files so you can identify each population results.

+

Switching to your local computer

+

Alright, now you need to have this .Rmd file and all outputs in the same directory. You can copy to your machine. Or you can remote access it somewhere. It’s up to you. But if you followed the instructions at the top of the page you’ll have all you need already in your local machine.

+

Plotting SFS

+

You can run the code below in R to see how the SFS plots look like. Does it look like the plot we’ve seen during the theoretical session?

+

Let’s look population by population.

+

First, PEL:

+
+
# function to normalize
+norm <- function(x) x/sum(x)
+# read data
+sfs <- (scan("pel_results.sfs"))
+# the variability as percentile
+pvar <- (1 - sfs[1] - sfs[length(sfs)]) * 100
+# the variable categories of the sfs
+sfs <- norm(sfs[-c(1, length(sfs))])
+barplot(sfs, xlab = "Derived allele frequency", names = 1:length(sfs), ylab = "Proportions",
+    main = "SFS plot", col = "blue")
+
+

then LWK:

+
+
# function to normalize
+norm <- function(x) x/sum(x)
+# read data
+sfs <- (scan("lwk_results.sfs"))
+# the variability as percentile
+pvar <- (1 - sfs[1] - sfs[length(sfs)]) * 100
+# the variable categories of the sfs
+sfs <- norm(sfs[-c(1, length(sfs))])
+barplot(sfs, xlab = "Derived allele frequency", names = 1:length(sfs), ylab = "Proportions",
+    main = "SFS plot", col = "blue")
+
+

And TSI:

+
+
# function to normalize
+norm <- function(x) x/sum(x)
+# read data
+sfs <- (scan("tsi_results.sfs"))
+# the variability as percentile
+pvar <- (1 - sfs[1] - sfs[length(sfs)]) * 100
+# the variable categories of the sfs
+sfs <- norm(sfs[-c(1, length(sfs))])
+barplot(sfs, xlab = "Derived allele frequency", names = 1:length(sfs), ylab = "Proportions",
+    main = "SFS plot", col = "blue")
+
+

They are all a bit different, why do you think is that?

+

Let’s see them all together:

+
+
# function to normalize
+nnorm <- function(x) x/sum(x)
+# expected number of sites with 1:20 derived alleles
+res <- rbind(PEL = scan("pel_results.sfs")[-1], LWK = scan("LWK_results.sfs")[-1],
+    TSI = scan("TSI_results.sfs")[-1])
+colnames(res) <- 1:20
+
+# density instead of expected counts
+res <- t(apply(res, 1, nnorm))
+
+# plot the polymorphic sites.
+resPoly <- t(apply(res[, -20], 1, nnorm))
+barplot(resPoly, beside = T, legend = c("PEL", "LWK", "TSI"), names = 1:19, main = "Derived allele frequencies")
+
+

References

+
    +
  • You can learn a lot by browsing the ANGSD website: http://www.popgen.dk/angsd/index.php/ANGSD
  • +
  • All data comes from the 1000 Genomes Project https://www.nature.com/articles/nature15393
  • +
  • You can check Matteo Fumagalli’s ngsTools Tutorial page for more exercises using the same dataset. It will show you how to do other things we didn’t have time to do today, like FST, PCAs, etc, using ANGSD/ngsTools: https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md
  • +
  • This exercise just scratched the surface of what you can do with this type of data. If you are interested in running more interesting analyses, I recommend that you take a look at dadi: https://dadi.readthedocs.io/en/latest/ They have a nice Jupyterlab notebook (like you have been using in the previous days) which will help you a lot.
  • +
  • I also recommend Fastsimcoal2: http://cmpg.unibe.ch/software/fastsimcoal2/
  • +
- - -

References

-
-Li, H., & Durbin, R. (2011). Inference of human population history from individual whole-genome sequences. Nature, 475(7357), 493–496. https://doi.org/10.1038/nature10231 -
-
-
-
-
+ \ No newline at end of file diff --git a/exercises/index.html b/exercises/index.html index 6077ac84..b95be713 100644 --- a/exercises/index.html +++ b/exercises/index.html @@ -283,7 +283,7 @@

Exercises

-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
-
+
diff --git a/exercises/population_structure/pca_mds_toy_example.html b/exercises/population_structure/pca_mds_toy_example.html index 52755069..8326d165 100644 --- a/exercises/population_structure/pca_mds_toy_example.html +++ b/exercises/population_structure/pca_mds_toy_example.html @@ -783,7 +783,7 @@

MDS for popula

- + diff --git a/exercises/variant_calling/index.html b/exercises/variant_calling/index.html index 82ac5815..ac6cc03f 100644 --- a/exercises/variant_calling/index.html +++ b/exercises/variant_calling/index.html @@ -1789,7 +1789,7 @@

Variant calling

- + \ No newline at end of file diff --git a/exercises/variant_filtering/index.html b/exercises/variant_filtering/index.html index 219ad4e9..821a30e5 100644 --- a/exercises/variant_filtering/index.html +++ b/exercises/variant_filtering/index.html @@ -670,7 +670,7 @@

Variant filtering

bcftools stats variantsites.subset.vcf.gz |\ grep "number of records:"
-
SN  0   number of records:  102590
+
SN  0   number of records:  102642

The -r parameter sets the rate of sampling which is why we get approximately 100,000 sites. You will need to adjust this parameter accordingly.

@@ -690,18 +690,18 @@

Variant filtering

csvtk summary -t -f MEAN_DEPTH:mean ${OUT}.idepth
INDV    N_SITES MEAN_DEPTH
-PUN-R-ELF   102402  8.55983
-PUN-R-JMC   102391  9.52674
-PUN-R-LH    102424  9.24296
-PUN-R-MT    102384  9.45553
-PUN-R-UCSD  102466  8.89254
-PUN-Y-BCRD  102392  9.85856
-PUN-Y-INJ   102348  8.2588
-PUN-Y-LO    102355  7.81509
-PUN-Y-PCT   102341  9.14405
-PUN-Y-POTR  102325  8.47222
+PUN-R-ELF   102456  8.56523
+PUN-R-JMC   102448  9.4699
+PUN-R-LH    102478  9.16915
+PUN-R-MT    102438  9.42865
+PUN-R-UCSD  102524  8.88314
+PUN-Y-BCRD  102439  9.79692
+PUN-Y-INJ   102404  8.22082
+PUN-Y-LO    102408  7.75924
+PUN-Y-PCT   102397  9.09537
+PUN-Y-POTR  102367  8.44972
 MEAN_DEPTH:mean
-8.92
+8.88

The average coverage over all samples is 8.9X. This actually is in the low range for a protocol based on explicitly calling genotypes. At 5X coverage, there may be a high probability that only one of the alleles has been sampled (Nielsen et al., 2011), whereby sequencing errors may be mistaken for true variation.

@@ -711,8 +711,8 @@

Variant filtering

head -n 3 ${OUT}.ldepth
CHROM   POS SUM_DEPTH   SUMSQ_DEPTH
-LG4 2105    9   23
-LG4 2130    11  21
+LG4 2130 11 21 +LG4 2139 12 26

So, for each position, we have a value (column SUM_DEPTH) for the total depth across all samples.

@@ -753,7 +753,7 @@

Variant filtering

csvtk summary -t -f 3:min,3:q1,3:median,3:mean,3:q3,3:max,3:stdev vcftools/variantsites.subset.ldepth
SUM_DEPTH:min   SUM_DEPTH:q1    SUM_DEPTH:median    SUM_DEPTH:mean  SUM_DEPTH:q3    SUM_DEPTH:max   SUM_DEPTH:stdev
-1.00    55.00   72.00   89.05   88.00   33988.00    242.59
+1.00 55.00 72.00 88.66 88.00 14772.00 213.95

The range from the first quartile (q1) to the third (q3) is 55-88, showing most sites have a depth between 50-100X. We redraw the plot to zoom in on the peak:

@@ -771,8 +771,8 @@

Variant filtering

head -n 3 ${OUT}.ldepth.mean
CHROM   POS MEAN_DEPTH  VAR_DEPTH
-LG4 2105    0.9 1.65556
-LG4 2130    1.1 0.988889
+LG4 2130 1.1 0.988889 +LG4 2139 1.2 1.28889

Variant quality distribution

@@ -806,8 +806,8 @@

Variant filtering

head -n 3 ${OUT}.frq
CHROM   POS N_ALLELES   N_CHR   {FREQ}
-LG4 2105    2   4   0   1
-LG4 2130    2   14  0.857143    0.142857
+LG4 2130 2 14 0.857143 0.142857 +LG4 2139 2 8 0.25 0.75

The last two columns are frequencies ordered according to the reference allele. Therefore, we need to pick the minimum value to get the MAF. We can use csvtk mutate to create a new column

@@ -836,8 +836,8 @@

Variant filtering

head -n 3 ${OUT}.imiss
INDV    N_DATA  N_GENOTYPES_FILTERED    N_MISS  F_MISS
-PUN-R-ELF   102590  0   6911    0.0673652
-PUN-R-JMC   102590  0   6588    0.0642168
+PUN-R-ELF 102642 0 6848 0.0667173 +PUN-R-JMC 102642 0 6578 0.0640868

and look at the results, focusing on the F_MISS column (proportion missing sites):

@@ -858,8 +858,8 @@

Variant filtering

head -n 3 ${OUT}.lmiss
CHR POS N_DATA  N_GENOTYPE_FILTERED N_MISS  F_MISS
-LG4 2105    20  0   16  0.8
-LG4 2130    20  0   6   0.3
+LG4 2130 20 0 6 0.3 +LG4 2139 20 0 12 0.6

and plot to get an idea of if there are sites that lack data.

@@ -889,16 +889,16 @@

Variant filtering

cat ${OUT}.het
INDV    O(HOM)  E(HOM)  N_SITES F
-PUN-R-ELF   72100   65267.6 86983   0.31463
-PUN-R-JMC   67871   65458.4 87179   0.11107
-PUN-R-LH    70524   65653.5 87430   0.22366
-PUN-R-MT    69560   65381.0 87028   0.19305
-PUN-R-UCSD  74271   65214.7 87162   0.41264
-PUN-Y-BCRD  68937   64547.6 85883   0.20573
-PUN-Y-INJ   70810   63737.1 84608   0.33889
-PUN-Y-LO    72427   62975.0 83991   0.44975
-PUN-Y-PCT   74072   64096.2 85537   0.46527
-PUN-Y-POTR  71243   64148.5 85373   0.33426
+PUN-R-ELF 72323 65412.9 87084 0.31886 +PUN-R-JMC 67969 65599.8 87267 0.10934 +PUN-R-LH 70693 65839.7 87590 0.22314 +PUN-R-MT 69776 65553.5 87168 0.19536 +PUN-R-UCSD 74411 65349.5 87244 0.41387 +PUN-Y-BCRD 69165 64719.3 86011 0.20880 +PUN-Y-INJ 70962 63930.5 84770 0.33741 +PUN-Y-LO 72544 63156.1 84143 0.44732 +PUN-Y-PCT 74263 64289.1 85699 0.46586 +PUN-Y-POTR 71483 64301.6 85491 0.33891

Here, F is a measure of how much the observed homozygotes O(HOM) differ from the expected (E(HOM); expected by chance under Hardy-Weinberg equilibrium), and may be negative. vcftools calculates F from the expression \(F=(O-E)/(N-E)\)8, which you can verify by substituting the variables in the output.

@@ -953,11 +953,11 @@

Variant filtering

vcftools --vcf - --het --stdout 2>/dev/null
INDV    O(HOM)  E(HOM)  N_SITES F
-PUN-R-ELF   41674   37654.5 56557   0.21264
-PUN-R-JMC   37344   37720.2 56652   -0.01987
-PUN-R-LH    39812   37798.2 56718   0.10644
-PUN-R-MT    38811   37532.7 56279   0.06819
-PUN-R-UCSD  43475   37421.8 56366   0.31953
+PUN-R-ELF 41613 37537.7 56374 0.21635 +PUN-R-JMC 37181 37610.7 56479 -0.02277 +PUN-R-LH 39699 37708.3 56596 0.10540 +PUN-R-MT 38768 37454.0 56160 0.07024 +PUN-R-UCSD 43364 37322.2 56197 0.32010
@@ -993,14 +993,14 @@

Variant filtering

bcftools stats $OUTVCF | grep "^SN"
SN  0   number of samples:  10
-SN  0   number of records:  37596
+SN  0   number of records:  37682
 SN  0   number of no-ALTs:  0
-SN  0   number of SNPs: 37389
+SN  0   number of SNPs: 37480
 SN  0   number of MNPs: 0
 SN  0   number of indels:   0
 SN  0   number of others:   0
-SN  0   number of multiallelic sites:   1640
-SN  0   number of multiallelic SNP sites:   679
+SN 0 number of multiallelic sites: 1647 +SN 0 number of multiallelic SNP sites: 681

Quite a substantial portion variants have in fact been removed, which here can most likely be attributed to the low average sequencing coverage.

@@ -1319,13 +1319,10 @@

Variant filtering

The syntax to generate coverage information for a BAM file is mosdepth <prefix> <input file>. Here, we add the -Q option to exclude reads with a mapping quality less than 20:

mosdepth -Q 20 PUN-Y-INJ PUN-Y-INJ.sort.dup.recal.bam
-
-
[W::hts_idx_load3] The index file is older than the data file: PUN-Y-INJ.sort.dup.recal.bai
-

The per-base coverage output file will be named PUN-Y-INJ.per-base.bed.gz and can be viewed with bgzip:

-
bgzip -c -d PUN-Y-INJ.per-base.bed.gz | head -n 5
+
bgzip -c -d PUN-Y-INJ.per-base.bed.gz | head -n 5
LG4 0   9   3
 LG4 9   13  4
@@ -1339,11 +1336,11 @@ 

Variant filtering

The second is csvtk, a program that provides tools for dealing with tab- or comma-separated text files. Avid R users will see that many of the subcommands are similar to those in the tidyverse dplyr package.

We can use these programs in a one-liner to generate a simple coverage plot (Figure 10)11

-
bedtools intersect -a <(bedtools makewindows -g ${REF}.fai -w 1000) \
-      -b PUN-Y-INJ.per-base.bed.gz -wa -wb | \
-  bedtools groupby -i - -g 1,2,3 -c 7 -o mean | \
-  csvtk plot -t line -x 2 -y 4 --point-size 0.01 --xlab Position \
-      --ylab Coverage --width 9.0 --height 3.5 > fig-plot-coverage.png
+
bedtools intersect -a <(bedtools makewindows -g ${REF}.fai -w 1000) \
+      -b PUN-Y-INJ.per-base.bed.gz -wa -wb | \
+  bedtools groupby -i - -g 1,2,3 -c 7 -o mean | \
+  csvtk plot -t line -x 2 -y 4 --point-size 0.01 --xlab Position \
+      --ylab Coverage --width 9.0 --height 3.5 > fig-plot-coverage.png
@@ -1354,22 +1351,21 @@

Variant filtering

Apparently there are some high-coverage regions that could be associated with, e.g., collapsed repeat regions in the assembly. Let’s compile coverage results for all samples, using bash string manipulation to generate file prefix12

-
# [A-Z] matches characters A-Z, * is a wildcard character that matches
-# anything
-for f in [A-Z]*.sort.dup.recal.bam; do
- # Extract the prefix by removing .sort.dup.recal.bam
- prefix=${f%.sort.dup.recal.bam}
- mosdepth -Q 20 $prefix $f
- # Print sample name
- echo -e -n "$prefix\t"
- # Get the summary line containing the total coverage
- cat $prefix.mosdepth.summary.txt | grep total
-done > ALL.mosdepth.summary.txt
-# View the file
-cat ALL.mosdepth.summary.txt
+
# [A-Z] matches characters A-Z, * is a wildcard character that matches
+# anything
+for f in [A-Z]*.sort.dup.recal.bam; do
+ # Extract the prefix by removing .sort.dup.recal.bam
+ prefix=${f%.sort.dup.recal.bam}
+ mosdepth -Q 20 $prefix $f
+ # Print sample name
+ echo -e -n "$prefix\t"
+ # Get the summary line containing the total coverage
+ cat $prefix.mosdepth.summary.txt | grep total
+done > ALL.mosdepth.summary.txt
+# View the file
+cat ALL.mosdepth.summary.txt
-
[W::hts_idx_load3] The index file is older than the data file: PUN-Y-INJ.sort.dup.recal.bai
-PUN-R-ELF   total   100000  845166  8.45    0   187
+
PUN-R-ELF   total   100000  845166  8.45    0   187
 PUN-R-JMC   total   100000  934032  9.34    0   244
 PUN-R-LH    total   100000  940246  9.40    0   225
 PUN-R-MT    total   100000  956272  9.56    0   233
@@ -1383,7 +1379,7 @@ 

Variant filtering

We can calculate the total coverage by summing the values of the fifth column with csvtk as follows:

-
csvtk summary -H -t ALL.mosdepth.summary.txt -f 5:sum
+
csvtk summary -H -t ALL.mosdepth.summary.txt -f 5:sum

to get the total coverage 84.78, which gives a hint at where the diploid coverage peak should be.

Sample set coverages

@@ -1391,10 +1387,10 @@

Variant filtering

We can combine the coverage output from different samples with bedtools unionbedg. We begin by generating a coverage file for all samples, where the output columns will correspond to individual samples. To save typing, we collect the sample names and generate matching BED file names to pass as arguments to options -names and -i, respectively. Also, we include positions with no coverage (-empty) which requires the use of a genome file (option -g). The BED output is piped to bgzip which compresses the output, before finally indexing with tabix13.

-
SAMPLES=$(csvtk grep -f Taxon -r -p "yellow" -r -p "red" sampleinfo.csv | csvtk cut -f SampleAlias | grep -v SampleAlias | tr "\n" " ")
-BEDGZ=$(for sm in $SAMPLES; do echo -e -n "${sm}.per-base.bed.gz "; done)
-bedtools unionbedg -header -names $SAMPLES -g ${REF}.fai -empty -i $BEDGZ | bgzip > ALL.bg.gz
-tabix -f -p bed -S 1 ALL.bg.gz
+
SAMPLES=$(csvtk grep -f Taxon -r -p "yellow" -r -p "red" sampleinfo.csv | csvtk cut -f SampleAlias | grep -v SampleAlias | tr "\n" " ")
+BEDGZ=$(for sm in $SAMPLES; do echo -e -n "${sm}.per-base.bed.gz "; done)
+bedtools unionbedg -header -names $SAMPLES -g ${REF}.fai -empty -i $BEDGZ | bgzip > ALL.bg.gz
+tabix -f -p bed -S 1 ALL.bg.gz
@@ -1489,21 +1485,21 @@

Variant filtering

Since we eventually want to filter on total coverage, we sum per sample coverages for each sample set with awk:

-
bgzip -c -d ALL.bg.gz | \
- awk -v FS="\t" -v OFS="\t" 'NR > 1 {sum=0; for (i=4; i<=NF; i++) sum+=$i; print $1, $2, $3, sum}' | \
- bgzip > ALL.sum.bed.gz
-tabix -f -p bed ALL.sum.bed.gz
+
bgzip -c -d ALL.bg.gz | \
+ awk -v FS="\t" -v OFS="\t" 'NR > 1 {sum=0; for (i=4; i<=NF; i++) sum+=$i; print $1, $2, $3, sum}' | \
+ bgzip > ALL.sum.bed.gz
+tabix -f -p bed ALL.sum.bed.gz

Here we use awk to sum from columns 4 and up (NF is the number of the last column).

For illustration, we plot the total coverage:

-
bedtools intersect -a <(bedtools makewindows -g ${REF}.fai -w 1000) \
-    -b ALL.sum.bed.gz -wa -wb | \
-  bedtools groupby -i - -g 1,2,3 -c 7 -o mean | \
-  csvtk plot -t line -x 2 -y 4 --point-size 0.01 --xlab Position \
-    --ylab Coverage --width 9.0 --height 3.5 > fig-plot-total-coverage.png
+
bedtools intersect -a <(bedtools makewindows -g ${REF}.fai -w 1000) \
+    -b ALL.sum.bed.gz -wa -wb | \
+  bedtools groupby -i - -g 1,2,3 -c 7 -o mean | \
+  csvtk plot -t line -x 2 -y 4 --point-size 0.01 --xlab Position \
+    --ylab Coverage --width 9.0 --height 3.5 > fig-plot-total-coverage.png
@@ -1516,24 +1512,24 @@

Variant filtering

In order to define thresholds for subsequent filtering, we need to know the total coverage distribution. Therefore, we plot the proportion of the genome coverage versus depth of coverage (similar to k-mer plots in sequence assembly projects). To do so we must summarize our coverage file such that we count how many bases have a given coverage. This can be achieved by noting that each row in the BED file consists of the columns CHROM, START, END, and COVERAGE. We can generate a histogram table by, for each value of COVERAGE, summing the length of the regions (END - START).

-
# Add column containing length of region (end - start)
-csvtk mutate2 -t -H -w 0 -e '$3-$2' ALL.sum.bed.gz | \
- # Sum the regions and *group* by the coverage (fourth column);
- # this gives the total number of bases with a given coverage
- csvtk summary -t -H -g 4 -f 5:sum -w 0 | \
- csvtk sort -t -k 1:n | \
- awk -v cumsum=0 'BEGIN {OFS=","; cumsum=0} {cumsum += $2; print $1,$2,cumsum}' > ALL.sum.bed.csv
+
# Add column containing length of region (end - start)
+csvtk mutate2 -t -H -w 0 -e '$3-$2' ALL.sum.bed.gz | \
+ # Sum the regions and *group* by the coverage (fourth column);
+ # this gives the total number of bases with a given coverage
+ csvtk summary -t -H -g 4 -f 5:sum -w 0 | \
+ csvtk sort -t -k 1:n | \
+ awk -v cumsum=0 'BEGIN {OFS=","; cumsum=0} {cumsum += $2; print $1,$2,cumsum}' > ALL.sum.bed.csv

We plot the coverage distribution below, along with a plot of the cumulative coverage.

-
csvtk plot line -H ALL.sum.bed.csv -x 1 -y 2 --point-size 0.01 \
-   --xlab "Depth of coverage (X)" --ylab "Genome coverage (bp)" \
-   --width 9.0 --height 3.5 > fig-plot-total-coverage-distribution.png
-csvtk plot line -H ALL.sum.bed.csv -x 1 -y 3 --point-size 0.01 \
-   --xlab "Depth of coverage (X)" --ylab "Cumulative genome coverage (kbp)" \
-   --width 9.0 --height 3.5 > fig-plot-total-coverage-distribution-cumulative.png
+
csvtk plot line -H ALL.sum.bed.csv -x 1 -y 2 --point-size 0.01 \
+   --xlab "Depth of coverage (X)" --ylab "Genome coverage (bp)" \
+   --width 9.0 --height 3.5 > fig-plot-total-coverage-distribution.png
+csvtk plot line -H ALL.sum.bed.csv -x 1 -y 3 --point-size 0.01 \
+   --xlab "Depth of coverage (X)" --ylab "Cumulative genome coverage (kbp)" \
+   --width 9.0 --height 3.5 > fig-plot-total-coverage-distribution-cumulative.png
@@ -1554,9 +1550,9 @@

Variant filtering

In Figure 12 a, a diploid peak is evident at around coverage X=100; we zoom in on that region to get a better view:

-
csvtk plot line -H ALL.sum.bed.csv -x 1 -y 2 --point-size 0.01 \
-    --xlab "Depth of coverage (X)" --ylab "Genome coverage (bp)" \
-    --width 9.0 --height 3.5 --x-min 40 --x-max 140
+
csvtk plot line -H ALL.sum.bed.csv -x 1 -y 2 --point-size 0.01 \
+    --xlab "Depth of coverage (X)" --ylab "Genome coverage (bp)" \
+    --width 9.0 --height 3.5 --x-min 40 --x-max 140
@@ -1571,10 +1567,10 @@

Variant filtering

We then use these thresholds to generate a BED file containing regions that are accessible, i.e., have sufficient coverage for downstream analyses. We also calculate the number of bases that pass the filtering criteria.

-
csvtk filter -t -H ALL.sum.bed.gz -f '4>50' | \
- csvtk filter -t -H -f '4<110' | \
- bgzip -c > ALL.sum.depth.bed.gz
-bedtools genomecov -i ALL.sum.depth.bed.gz -g ${REF}.fai  | grep genome
+
csvtk filter -t -H ALL.sum.bed.gz -f '4>50' | \
+ csvtk filter -t -H -f '4<110' | \
+ bgzip -c > ALL.sum.depth.bed.gz
+bedtools genomecov -i ALL.sum.depth.bed.gz -g ${REF}.fai  | grep genome

Consequently, 75.2% of the genome is accessible by depth.

@@ -1610,26 +1606,26 @@

Variant filtering

We base the answer on the previous code.

-
for pop in red yellow; do
- bgzip -c -d $pop.bg.gz | \
-  awk -v FS="\t" -v OFS="\t" 'NR > 1 {sum=0; for (i=4; i<=NF; i++) sum+=$i; print $1, $2, $3, sum}' | \
-  bgzip > $pop.sum.bed.gz
- tabix -f -p bed $pop.sum.bed.gz
- csvtk mutate2 -t -H -w 0 -e '$3-$2' $pop.sum.bed.gz | \
- csvtk summary -t -H -g 4 -f 5:sum -w 0 | \
- csvtk sort -t -k 1:n | \
- awk -v cumsum=0 'BEGIN {OFS=","; cumsum=0} {cumsum += $2; print $1,$2,cumsum}' > $pop.sum.bed.csv
-done
+
for pop in red yellow; do
+ bgzip -c -d $pop.bg.gz | \
+  awk -v FS="\t" -v OFS="\t" 'NR > 1 {sum=0; for (i=4; i<=NF; i++) sum+=$i; print $1, $2, $3, sum}' | \
+  bgzip > $pop.sum.bed.gz
+ tabix -f -p bed $pop.sum.bed.gz
+ csvtk mutate2 -t -H -w 0 -e '$3-$2' $pop.sum.bed.gz | \
+ csvtk summary -t -H -g 4 -f 5:sum -w 0 | \
+ csvtk sort -t -k 1:n | \
+ awk -v cumsum=0 'BEGIN {OFS=","; cumsum=0} {cumsum += $2; print $1,$2,cumsum}' > $pop.sum.bed.csv
+done
-
for pop in red yellow; do
-cat ${pop}.sum.bed.csv | \
-  csvtk plot line -x 1 -y 2 --point-size 0.01 \
-    --xlab "Depth of coverage (X)" --ylab "Genome coverage (bp)" \
-    --width 9.0 --height 3.5 --x-min 0 --x-max 100 > \
-    fig-plot-total-coverage-distribution-hist-zoom-in-$pop.png
-done
+
for pop in red yellow; do
+cat ${pop}.sum.bed.csv | \
+  csvtk plot line -x 1 -y 2 --point-size 0.01 \
+    --xlab "Depth of coverage (X)" --ylab "Genome coverage (bp)" \
+    --width 9.0 --height 3.5 --x-min 0 --x-max 100 > \
+    fig-plot-total-coverage-distribution-hist-zoom-in-$pop.png
+done
@@ -1649,17 +1645,17 @@

Variant filtering

Based on Figure 14, we generate bed files with depths passing cutoffs:

-
# red filter: 20-60
-csvtk filter -t -H red.sum.bed.gz -f '4>20' | \
- csvtk filter -t -H -f '4<60' | \
- bgzip -c > red.sum.depth.bed.gz
-bedtools genomecov -i red.sum.depth.bed.gz -g ${REF}.fai  | grep genome
-
-# yellow filter: 20-55
-csvtk filter -t -H yellow.sum.bed.gz -f '4>20' | \
- csvtk filter -t -H -f '4<55' | \
- bgzip -c > yellow.sum.depth.bed.gz
-bedtools genomecov -i yellow.sum.depth.bed.gz -g ${REF}.fai  | grep genome
+
# red filter: 20-60
+csvtk filter -t -H red.sum.bed.gz -f '4>20' | \
+ csvtk filter -t -H -f '4<60' | \
+ bgzip -c > red.sum.depth.bed.gz
+bedtools genomecov -i red.sum.depth.bed.gz -g ${REF}.fai  | grep genome
+
+# yellow filter: 20-55
+csvtk filter -t -H yellow.sum.bed.gz -f '4>20' | \
+ csvtk filter -t -H -f '4<55' | \
+ bgzip -c > yellow.sum.depth.bed.gz
+bedtools genomecov -i yellow.sum.depth.bed.gz -g ${REF}.fai  | grep genome
genome  0   20853   100000  0.20853
 genome  1   79147   100000  0.79147
@@ -1682,10 +1678,10 @@ 

Variant filtering

In addition to filtering on total coverage, we will also filter on the minimum number of individuals with a minimum depth. This is to account for cases where regions that pass the minimum coverage filter originate from just a few samples with unusually high coverage. Here, we will remove sites where more than 50% of individuals have zero coverage.

-
bgzip -c -d ALL.bg.gz | \
-  awk -v FS="\t" 'BEGIN {OFS="\t"} NR > 1 {count=0; for (i=4; i<=NF; i++) {if ($i>0) count+=1}; if (count>=((NF-3)*0.5)) {print $1, $2, $3}}' | \
-  bgzip > ALL.ind.bed.gz
-tabix -f -p bed ALL.ind.bed.gz
+
bgzip -c -d ALL.bg.gz | \
+  awk -v FS="\t" 'BEGIN {OFS="\t"} NR > 1 {count=0; for (i=4; i<=NF; i++) {if ($i>0) count+=1}; if (count>=((NF-3)*0.5)) {print $1, $2, $3}}' | \
+  bgzip > ALL.ind.bed.gz
+tabix -f -p bed ALL.ind.bed.gz
@@ -1717,7 +1713,7 @@

Variant filtering

-
bedtools genomecov -i ALL.ind.bed.gz -g ${REF}.fai
+
bedtools genomecov -i ALL.ind.bed.gz -g ${REF}.fai
LG4 0   7694    100000  0.07694
 LG4 1   92306   100000  0.92306
@@ -1767,10 +1763,10 @@ 

Variant filtering

-
for pop in red yellow; do
-  bgzip -c -d $pop.bed.gz | awk -v FS="\t" 'BEGIN {OFS="\t"} NR > 1 {count=0; for (i=4; i<=NF; i++) {if ($i>0) count+=1}; if (count>=((NF-3)*0.5)) {print $1, $2, $3}}' | bgzip > $pop.ind.bed.gz
-  tabix -f -p bed $pop.ind.bed.gz
-done
+
for pop in red yellow; do
+  bgzip -c -d $pop.bed.gz | awk -v FS="\t" 'BEGIN {OFS="\t"} NR > 1 {count=0; for (i=4; i<=NF; i++) {if ($i>0) count+=1}; if (count>=((NF-3)*0.5)) {print $1, $2, $3}}' | bgzip > $pop.ind.bed.gz
+  tabix -f -p bed $pop.ind.bed.gz
+done
[bgzip] Could not open red.bed.gz: No such file or directory
 [bgzip] Could not open yellow.bed.gz: No such file or directory
@@ -1790,17 +1786,17 @@

Variant filtering

Sequence masks

Finally, as before, we can convert the BED output to sequence mask files in FASTA format. Recall that we first have to make a template genome mask file where all positions are masked:

-
awk 'BEGIN {OFS="\t"} {print $1, 0, $2}' ${REF}.fai > ${REF}.bed
-bedtools maskfasta -fi ${REF} -mc 1 -fo ${REF}.mask.fa -bed ${REF}.bed
+
awk 'BEGIN {OFS="\t"} {print $1, 0, $2}' ${REF}.fai > ${REF}.bed
+bedtools maskfasta -fi ${REF} -mc 1 -fo ${REF}.mask.fa -bed ${REF}.bed

Then, for each sample set, we will use bedtools intersect to intersect the BED files corresponding to the total sum coverage and the filter on number of individuals. bedtools intersect makes it easy to combine multiple BED files, so any other filters, or genomic features such as exons, could be added to make a compound mask file. The resulting BED files is used as input to bedtools maskfasta.

-
bedtools intersect -a ALL.sum.depth.bed.gz -b ALL.ind.bed.gz \
-   -g ${REF}.fai | bgzip > ALL.unmask.bed.gz
-tabix -f -p bed ALL.unmask.bed.gz
-bedtools maskfasta -fi ${REF}.mask.fa -mc 0 -fo ${REF}.unmask.fa \
-   -bed ALL.unmask.bed.gz
-head -n 3 ${REF}.unmask.fa
+
bedtools intersect -a ALL.sum.depth.bed.gz -b ALL.ind.bed.gz \
+   -g ${REF}.fai | bgzip > ALL.unmask.bed.gz
+tabix -f -p bed ALL.unmask.bed.gz
+bedtools maskfasta -fi ${REF}.mask.fa -mc 0 -fo ${REF}.unmask.fa \
+   -bed ALL.unmask.bed.gz
+head -n 3 ${REF}.unmask.fa
>LG4
 111111111111111111111111111111111111111111111111111111111111
@@ -1809,10 +1805,10 @@ 

Variant filtering

We can once again convince ourselves that this has worked by counting the number of unmasked positions:

-
# tr: -d deletes all characters not (-c, complement) in the character
-# set '0'. wc: -m option counts characters
-cat ${REF}.unmask.fa | tr -d -c '0' | wc -m
-bedtools genomecov -i ALL.unmask.bed.gz -g ${REF}.fai | grep genome
+
# tr: -d deletes all characters not (-c, complement) in the character
+# set '0'. wc: -m option counts characters
+cat ${REF}.unmask.fa | tr -d -c '0' | wc -m
+bedtools genomecov -i ALL.unmask.bed.gz -g ${REF}.fai | grep genome
75236
 genome  0   24764   100000  0.24764
@@ -1850,13 +1846,13 @@ 

Variant filtering

-
for pop in red yellow; do
- bedtools intersect -a $pop.sum.depth.bed.gz -b $pop.ind.bed.gz \
-    -g ${REF}.fai | bgzip > $pop.unmask.bed.gz
- tabix -f -p bed $pop.unmask.bed.gz
- bedtools maskfasta -fi ${REF}.mask.fa -mc 0 -fo ${REF}.$pop.unmask.fa \
-    -bed $pop.unmask.bed.gz
-done
+
for pop in red yellow; do
+ bedtools intersect -a $pop.sum.depth.bed.gz -b $pop.ind.bed.gz \
+    -g ${REF}.fai | bgzip > $pop.unmask.bed.gz
+ tabix -f -p bed $pop.unmask.bed.gz
+ bedtools maskfasta -fi ${REF}.mask.fa -mc 0 -fo ${REF}.$pop.unmask.fa \
+    -bed $pop.unmask.bed.gz
+done
@@ -2167,7 +2163,7 @@

Variant filtering

- + \ No newline at end of file diff --git a/exercises/variant_filtering/index_files/figure-html/fig-coverage-filters-1.dvi b/exercises/variant_filtering/index_files/figure-html/fig-coverage-filters-1.dvi index 2a0bc0b5..b855562a 100644 Binary files a/exercises/variant_filtering/index_files/figure-html/fig-coverage-filters-1.dvi and b/exercises/variant_filtering/index_files/figure-html/fig-coverage-filters-1.dvi differ diff --git a/exercises/variant_filtering/index_files/figure-html/fig-coverage-filters-1.svg b/exercises/variant_filtering/index_files/figure-html/fig-coverage-filters-1.svg index f7865a0c..247a313f 100644 --- a/exercises/variant_filtering/index_files/figure-html/fig-coverage-filters-1.svg +++ b/exercises/variant_filtering/index_files/figure-html/fig-coverage-filters-1.svg @@ -2,9 +2,9 @@ diff --git a/slides/foundations/index_files/figure-revealjs/fig-selective-sweep-phases-1.dvi b/slides/foundations/index_files/figure-revealjs/fig-selective-sweep-phases-1.dvi index 875d30a5..f3b983e8 100644 Binary files a/slides/foundations/index_files/figure-revealjs/fig-selective-sweep-phases-1.dvi and b/slides/foundations/index_files/figure-revealjs/fig-selective-sweep-phases-1.dvi differ diff --git a/slides/foundations/index_files/figure-revealjs/gas-particle-drift-1.dvi b/slides/foundations/index_files/figure-revealjs/gas-particle-drift-1.dvi index 2fd67900..3a139af3 100644 Binary files a/slides/foundations/index_files/figure-revealjs/gas-particle-drift-1.dvi and b/slides/foundations/index_files/figure-revealjs/gas-particle-drift-1.dvi differ diff --git a/slides/foundations/index_files/figure-revealjs/gas-particle-drift-1.svg b/slides/foundations/index_files/figure-revealjs/gas-particle-drift-1.svg index d1c8ea1d..13a6454d 100644 --- a/slides/foundations/index_files/figure-revealjs/gas-particle-drift-1.svg +++ b/slides/foundations/index_files/figure-revealjs/gas-particle-drift-1.svg @@ -2,7 +2,7 @@ diff --git a/slides/foundations/index_files/figure-revealjs/mutation-models-1.dvi b/slides/foundations/index_files/figure-revealjs/mutation-models-1.dvi index 307d997a..436f3084 100644 Binary files a/slides/foundations/index_files/figure-revealjs/mutation-models-1.dvi and b/slides/foundations/index_files/figure-revealjs/mutation-models-1.dvi differ diff --git a/slides/foundations/index_files/figure-revealjs/mutation-models-1.svg b/slides/foundations/index_files/figure-revealjs/mutation-models-1.svg index f0870732..95123c8e 100644 --- a/slides/foundations/index_files/figure-revealjs/mutation-models-1.svg +++ b/slides/foundations/index_files/figure-revealjs/mutation-models-1.svg @@ -2,8 +2,8 @@

DNA sequences in FASTQ format

-
-rw-r--r-- 1 runner docker 302K Nov  7 14:22 PUN-Y-INJ_R1.fastq.gz
--rw-r--r-- 1 runner docker 393K Nov  7 14:22 PUN-Y-INJ_R2.fastq.gz
+
-rw-r--r-- 1 runner docker 302K Nov  7 14:47 PUN-Y-INJ_R1.fastq.gz
+-rw-r--r-- 1 runner docker 393K Nov  7 14:47 PUN-Y-INJ_R2.fastq.gz
diff --git a/slides/variant_calling/index_files/figure-revealjs/coalescent-tree-dna-variation-1.dvi b/slides/variant_calling/index_files/figure-revealjs/coalescent-tree-dna-variation-1.dvi index d7fa77c9..2e20ea4d 100644 Binary files a/slides/variant_calling/index_files/figure-revealjs/coalescent-tree-dna-variation-1.dvi and b/slides/variant_calling/index_files/figure-revealjs/coalescent-tree-dna-variation-1.dvi differ diff --git a/slides/variant_calling/index_files/figure-revealjs/coalescent-tree-dna-variation-1.svg b/slides/variant_calling/index_files/figure-revealjs/coalescent-tree-dna-variation-1.svg index eb853720..0512a688 100644 --- a/slides/variant_calling/index_files/figure-revealjs/coalescent-tree-dna-variation-1.svg +++ b/slides/variant_calling/index_files/figure-revealjs/coalescent-tree-dna-variation-1.svg @@ -2,7 +2,7 @@ diff --git a/slides/variant_calling/index_files/figure-revealjs/site-frequency-spectrum-graph-1.dvi b/slides/variant_calling/index_files/figure-revealjs/site-frequency-spectrum-graph-1.dvi index 4f69e305..d675b582 100644 Binary files a/slides/variant_calling/index_files/figure-revealjs/site-frequency-spectrum-graph-1.dvi and b/slides/variant_calling/index_files/figure-revealjs/site-frequency-spectrum-graph-1.dvi differ diff --git a/slides/variant_calling/index_files/figure-revealjs/site-frequency-spectrum-graph-1.svg b/slides/variant_calling/index_files/figure-revealjs/site-frequency-spectrum-graph-1.svg index 892730e6..1d5a6512 100644 --- a/slides/variant_calling/index_files/figure-revealjs/site-frequency-spectrum-graph-1.svg +++ b/slides/variant_calling/index_files/figure-revealjs/site-frequency-spectrum-graph-1.svg @@ -2,8 +2,8 @@

Monkeyflower variants

bcftools stats variantsites.vcf.gz | grep "^SN"
SN  0   number of samples:  10
-SN  0   number of records:  13103
+SN  0   number of records:  12809
 SN  0   number of no-ALTs:  0
-SN  0   number of SNPs: 10723
+SN  0   number of SNPs: 10541
 SN  0   number of MNPs: 0
-SN  0   number of indels:   2362
+SN  0   number of indels:   2290
 SN  0   number of others:   0
-SN  0   number of multiallelic sites:   1060
-SN  0   number of multiallelic SNP sites:   223
+SN 0 number of multiallelic sites: 1021 +SN 0 number of multiallelic SNP sites: 222

Use vcftools to compile data to generate summary statistics

@@ -883,14 +883,14 @@

Monkeyflower call set with invariant sites

bcftools stats allsites.vcf.gz | grep "^SN"
SN  0   number of samples:  10
-SN  0   number of records:  9973
-SN  0   number of no-ALTs:  9178
-SN  0   number of SNPs: 400
+SN  0   number of records:  9972
+SN  0   number of no-ALTs:  9173
+SN  0   number of SNPs: 395
 SN  0   number of MNPs: 0
-SN  0   number of indels:   114
+SN  0   number of indels:   129
 SN  0   number of others:   0
 SN  0   number of multiallelic sites:   56
-SN  0   number of multiallelic SNP sites:   7
+SN 0 number of multiallelic SNP sites: 9

Filtering as before but excluding MAF, variant quality filters

diff --git a/slides/variant_filtering/index_files/figure-revealjs/coverage-filters-1-1.dvi b/slides/variant_filtering/index_files/figure-revealjs/coverage-filters-1-1.dvi index 84c0d033..cee52003 100644 Binary files a/slides/variant_filtering/index_files/figure-revealjs/coverage-filters-1-1.dvi and b/slides/variant_filtering/index_files/figure-revealjs/coverage-filters-1-1.dvi differ diff --git a/slides/variant_filtering/index_files/figure-revealjs/coverage-filters-1-1.svg b/slides/variant_filtering/index_files/figure-revealjs/coverage-filters-1-1.svg index 2077f1e1..cad449d2 100644 --- a/slides/variant_filtering/index_files/figure-revealjs/coverage-filters-1-1.svg +++ b/slides/variant_filtering/index_files/figure-revealjs/coverage-filters-1-1.svg @@ -2,12 +2,12 @@