Add sidebar entries, slide to variant filtering slides

NBISweden · Nov 2, 2023 · 24fe760 · 24fe760
1 parent dbb6f93
commit 24fe760
Show file tree

Hide file tree

Showing 11 changed files with 156 additions and 16 deletions.
diff --git a/docs/_website.yml b/docs/_website.yml
@@ -38,14 +38,23 @@ website:
           - slides/pgip/index.qmd
           - slides/foundations/index.qmd
           - slides/simulation/index.qmd
+          - slides/variant_calling/index.qmd
+          - slides/variant_filtering/index.qmd
+          - slides/genetic_diversity/index.qmd
           - slides/population_structure/index.qmd
+          - slides/demography/index.qmd
+          - slides/selection/index.qmd
       - section: Exercises
         contents:
           - exercises/compute_environment/index.qmd
           - exercises/datasets/monkeyflowers.qmd
           - exercises/simulation/index.qmd
           - exercises/variant_calling/index.qmd
           - exercises/variant_filtering/index.qmd
+          - exercises/genetic_diversity/index.qmd
+          - exercises/population_structure/index.qmd
+          - exercises/demography/index.qmd
+          - exercises/selection/index.qmd
       - section: Code recipes
         contents:
           - recipes/index.qmd
diff --git a/docs/assets/bibliography.bib b/docs/assets/bibliography.bib
@@ -1488,6 +1488,18 @@ @book{miller_HumanBiology_2020
   urldate = {2023-10-01},
   langid = {english},
 }
+@article{nazareno_ThereNoRule_2021,
+  title = {There {{Is No}} `{{Rule}} of {{Thumb}}': {{Genomic Filter Settings}} for a {{Small Plant Population}} to {{Obtain Unbiased Gene Flow Estimates}}},
+  shorttitle = {There {{Is No}} `{{Rule}} of {{Thumb}}'},
+  author = {Nazareno, Alison G. and Knowles, L. Lacey},
+  year = {2021},
+  journal = {Frontiers in Plant Science},
+  volume = {12},
+  issn = {1664-462X},
+  url = {https://www.frontiersin.org/articles/10.3389/fpls.2021.677009},
+  urldate = {2023-11-01},
+  abstract = {The application of high-density polymorphic single-nucleotide polymorphisms (SNP) markers derived from high-throughput sequencing methods has heralded plenty of biological questions about the linkages of processes operating at micro- and macroevolutionary scales. However, the effects of SNP filtering practices on population genetic inference have received much less attention. By performing sensitivity analyses, we empirically investigated how decisions about the percentage of missing data (MD) and the minor allele frequency (MAF) set in bioinformatic processing of genomic data affect direct (i.e., parentage analysis) and indirect (i.e., fine-scale spatial genetic structure \textendash{} SGS) gene flow estimates. We focus specifically on these manifestations in small plant populations, and particularly, in the rare tropical plant species Dinizia jueirana-facao, where assumptions implicit to analytical procedures for accurate estimates of gene flow may not hold. Avoiding biases in dispersal estimates are essential given this species is facing extinction risks due to habitat loss, and so we also investigate the effects of forest fragmentation on the accuracy of dispersal estimates under different filtering criteria by testing for recent decrease in the scale of gene flow. Our sensitivity analyses demonstrate that gene flow estimates are robust to different setting of MAF (0.05\textendash 0.35) and MD (0\textendash 20\%). Comparing the direct and indirect estimates of dispersal, we find that contemporary estimates of gene dispersal distance ({$\sigma$}rt = 41.8 m) was {$\sim$} fourfold smaller than the historical estimates, supporting the hypothesis of a temporal shift in the scale of gene flow in D. jueirana-facao, which is consistent with predictions based on recent, dramatic forest fragmentation process. While we identified settings for filtering genomic data to avoid biases in gene flow estimates, we stress that there is no `rule of thumb' for bioinformatic filtering and that relying on default program settings is not advisable. Instead, we suggest that the approach implemented here be applied independently in each separate empirical study to confirm appropriate settings to obtain unbiased population genetics estimates.},
+}
 @book{nei_MolecularEvolutionPhylogenetics_2000,
   title = {Molecular {{Evolution}} and {{Phylogenetics}}},
   author = {Nei, Masatoshi and Kumar, Sudhir},
@@ -1498,6 +1510,7 @@ @book{nei_MolecularEvolutionPhylogenetics_2000
   abstract = {This book presents the statistical methods that are useful in the study of molecular evolution and illustrates how to use them in actual data analysis. Molecular evolution has been developing at a great pace over the past decade or so, driven by the huge increase in genetic sequence data from many organisms, the improvement of high-speed microcomputers, and the development of several new methods for phylogenetic analysis. This book for graduate students and researchers, assuming a basic knowledge of evolution, molecular biology, and elementary statistics, should make it possible for many investigators to incorporate refined statistical analysis of large-scale data in their own work. Nei is one of the leading workers in this area. He and Kumar have developed a computer program called MEGA, which has been sold for about \$20 to over 1900 users. For the book, the authors are thoroughly revising MEGA and will make it available via FTP. The book also included analysis using the other most popular programs for phylogenetic studies, including PAUP, PHYLIP, MOLPHY, and PAML.},
   isbn = {978-0-19-513585-5},
 }
+
 @article{nielsen_GenotypeSNPCalling_2011,
   title = {Genotype and {{SNP}} Calling from Next-Generation Sequencing Data},
   author = {Nielsen, Rasmus and Paul, Joshua S. and Albrechtsen, Anders and Song, Yun S.},
@@ -1515,7 +1528,6 @@ @article{nielsen_GenotypeSNPCalling_2011
   langid = {english},
   keywords = {Bioinformatics,Genomics,Next-generation sequencing,Population genetics,Technology},
 }
-
 @article{nielsen_MolecularSignaturesNatural_2005,
   title = {Molecular {{Signatures}} of {{Natural Selection}}},
   author = {Nielsen, Rasmus},

diff --git a/docs/exercises/demography/index.qmd b/docs/exercises/demography/index.qmd
@@ -2,7 +2,6 @@
 title: "Demographic inference"
 author:
   - "André Soares"
-  - "Per Unneberg"
 format: html
 ---
 

diff --git a/docs/exercises/genetic_diversity/index.qmd b/docs/exercises/genetic_diversity/index.qmd
@@ -34,6 +34,15 @@ compile statistics along a sequence. By scanning variation in windows
 along the sequence (a.k.a. genomic scan) we can identify outlier
 regions whose pattern of variation could potentially be attributed to
 causes other than neutral processes, such as adaptation or migration.
+We will use the Monkeyflower system to generate a diversity landscape.
+
+::: {.callout-important}
+
+The commands of this document have been run on a subset (a subregion)
+of the data. Therefore, although you will use the same commands, your
+results will differ from those presented here.
+
+:::
 
 ::: {.callout-tip collapse=true}
 
@@ -91,7 +100,7 @@ Execute the following command to load modules:
 module load uppmax bioinfo-tools \
     bcftools/1.17 \
     BEDTools/2.29.2 \
-       vcftools/0.1.16
+    vcftools/0.1.16
 ```
 
 `csvtk` has been added to the module system and can be loaded as

diff --git a/docs/exercises/index.qmd b/docs/exercises/index.qmd
@@ -44,15 +44,14 @@ URL that hosts the actual exercise instructions.
 ## On self-assessment exercise blocks
 
 Scattered throughout the documents are exercise blocks, with hidden
-answers, and in some cases, hints. The exercises are for
+answers, and, in some cases, hints. The exercises are for
 self-assessment of your understanding, but they are not mandatory.
 
-Some of the exercises (labelled with the linux penguin {{< fa
-brands linux >}}) are related to the usage of the command line
-interfaces (CLI), and how to obtain information about what a program
-does. This is an essential skill when working in Linux/UNIX
-environments! These exercises can be skipped if you are an experienced
-Linux/UNIX user.
+Some of the exercises (labelled with the Linux penguin {{< fa brands
+linux >}}) are related to the usage of the command line interfaces
+(CLI), and how to obtain information about what a program does. This
+is an essential skill when working in Linux/UNIX environments! These
+exercises can be skipped if you are an experienced Linux/UNIX user.
 
 An example exercise is provided here:
 

diff --git a/docs/exercises/population_structure/index.qmd b/docs/exercises/population_structure/index.qmd
@@ -2,7 +2,6 @@
 title: "Population structure"
 author:
   - "Nikolay Oskolkov"
-  - "Per Unneberg"
 format: html
 ---
 

diff --git a/docs/exercises/selection/index.qmd b/docs/exercises/selection/index.qmd
@@ -2,7 +2,6 @@
 title: Selection
 author:
   - "Jason Hill"
-  - "Per Unneberg"
 format: html
 ---
 

diff --git a/docs/exercises/variant_filtering/index.qmd b/docs/exercises/variant_filtering/index.qmd
@@ -31,6 +31,14 @@ sites only, followed by an approach that filters on sequencing depth
 in a variant file containing both variant and invariant sites. The
 latter methodology can then be generalized to generate depth-based filters from BAM files.
 
+::: {.callout-important}
+
+The commands of this document have been run on a subset (a subregion)
+of the data. Therefore, although you will use the same commands, your
+results will differ from those presented here.
+
+:::
+
 ::: {.callout-tip collapse=true}
 
 ## Learning objectives
@@ -637,6 +645,15 @@ deficit of heterozygotes (and consequently, positive F) simply due to
 something called the [Wahlund
 effect](https://en.wikipedia.org/wiki/Wahlund_effect).
 
+::: {.callout-warning}
+
+The inbreeding coefficient is a population-level statistic and is not
+reliable for small sample sizes ($n<10$, say). Therefore, our sample
+size is in the lower range and the results should be taken with a
+grain of salt.
+
+:::
+
 ::: {.callout-exercise}
 
 Use `bcftools view -s SAMPLENAMES | vcftools --vcf - --het --stdout` to

diff --git a/docs/slides/demography/index.qmd b/docs/slides/demography/index.qmd
@@ -2,7 +2,6 @@
 title: "Demographic inference"
 author:
   - "Andre Soares"
-  - "Per Unneberg"
 format: html
 ---
 

diff --git a/docs/slides/selection/index.qmd b/docs/slides/selection/index.qmd
@@ -2,7 +2,6 @@
 title: "Selection"
 author:
   - "Jason Hill"
-  - "Per Unneberg"
 format: html
 ---
 

diff --git a/docs/slides/variant_filtering/index.qmd b/docs/slides/variant_filtering/index.qmd
@@ -223,6 +223,104 @@ Source: [@lou_BeginnerGuideLowcoverage_2021]
 
 :::
 
+## Guidelines? What guidelines? {.smallr}
+
+#### GATK hard filters
+
+> However, because we want to help, we have formulated some generic
+> recommendations that should at least provide a starting point for
+> people to experiment with their data.
+
+::: {.fragment}
+
+:::: {.columns}
+
+::: {.column width="50%"}
+
+##### SNPs
+
+```
+QualByDepth (QD) < 2.0
+RMSMappingQuality (MQ) < 40.0
+FisherStrand (FS) > 60.0
+StrandOddsRatio (SOR) > 3.0
+MappingQualityRankSumTest (MQRankSum) < -12.5
+ReadPosRankSumTest (ReadPosRankSum) < -8.0
+```
+
+:::
+
+::: {.column width="50%"}
+
+##### Indels
+
+```
+QualByDepth (QD) < 2.0
+ReadPosRankSum (ReadPosRankSumTest) < -20.0
+InbreedingCoeff < -0.8
+FisherStrand (FS) > 200.0
+StrandOddsRatio (SOR) > 10.0
+```
+
+:::
+
+::::
+
+:::
+
+::: {.fragment}
+
+> That said, you ABSOLUTELY SHOULD NOT expect to run these commands
+> and be done with your analyses.
+
+:::
+
+::: {.flushright .smallest .translatey50}
+
+<https://gatk.broadinstitute.org/hc/en-us/articles/360037499012>
+
+:::
+
+::: {.fragment}
+
+#### On RAD-seq filtering
+
+> ... the effects of SNP filtering practices on population genetic
+> inference have received much less attention
+
+::: {.flushright .smallest .translatey50 }
+
+There Is No ‘Rule of Thumb’: Genomic Filter Settings for a Small Plant
+Population to Obtain Unbiased Gene Flow Estimates
+[@nazareno_ThereNoRule_2021]
+
+:::
+
+:::
+
+::: {.notes}
+
+General guidelines on manual filters are not discussed much in the
+literature, simply due to the fact that there is no set of rule of
+thumbs. Every problem requires its own settings, as the GATK
+developers maintain.
+
+GATK guidelines explained (see <https://gatk.broadinstitute.org/hc/en-us/articles/360035890471>):
+
+- QualByDepth (QFD): variant confidence (QUAL) divided by unfiltered depth
+- FisherStrand (FS): checks for strand bias (i.e., if minor allele
+  occurs more often on one strand)
+- StrandOddsRatio (SOR): alternative strand bias test
+- RMSMappingQuality (MQ): root mean square mapping quality over all
+  reads
+- MappingQualytRankSumTest (MQRankSum): compares mapping qualities of
+  ref and alt alleles
+- ReadPosRankSumTest (ReadPosRankSum): looks at site position within reads
+- InbreedingCoeff: population-level statistics that requires at least
+  10 individuals
+
+:::
+
 ## What about machine learning?
 
 :::: {.columns}
@@ -254,7 +352,7 @@ dimensions
 
 ::: {.fragment}
 
-Database of known variants often *not* known for non-model organisms.
+Caveat: database of known variants often *not* known for non-model organisms.
 
 :::
 
@@ -448,7 +546,8 @@ ggplot(data, aes(x=F, y=INDV)) + geom_point(size=3) +
 ```
 
 - F=0: Hardy-Weinberg Equilibrium
-- F>0: deficit of heterozygotes; inbreedeng, allele dropout
+- F>0: deficit of heterozygotes; inbreeding, Wahlund
+  effect (population substructure), allele dropout
 - F<0: surplus of heterozygotes; could be sample contamination, poor
   sequence quality (mismapping)