Skip to content

Commit

Permalink
Add sidebar entries, slide to variant filtering slides
Browse files Browse the repository at this point in the history
  • Loading branch information
percyfal committed Nov 2, 2023
1 parent dbb6f93 commit 24fe760
Show file tree
Hide file tree
Showing 11 changed files with 156 additions and 16 deletions.
9 changes: 9 additions & 0 deletions docs/_website.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,23 @@ website:
- slides/pgip/index.qmd
- slides/foundations/index.qmd
- slides/simulation/index.qmd
- slides/variant_calling/index.qmd
- slides/variant_filtering/index.qmd
- slides/genetic_diversity/index.qmd
- slides/population_structure/index.qmd
- slides/demography/index.qmd
- slides/selection/index.qmd
- section: Exercises
contents:
- exercises/compute_environment/index.qmd
- exercises/datasets/monkeyflowers.qmd
- exercises/simulation/index.qmd
- exercises/variant_calling/index.qmd
- exercises/variant_filtering/index.qmd
- exercises/genetic_diversity/index.qmd
- exercises/population_structure/index.qmd
- exercises/demography/index.qmd
- exercises/selection/index.qmd
- section: Code recipes
contents:
- recipes/index.qmd
14 changes: 13 additions & 1 deletion docs/assets/bibliography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1488,6 +1488,18 @@ @book{miller_HumanBiology_2020
urldate = {2023-10-01},
langid = {english},
}
@article{nazareno_ThereNoRule_2021,
title = {There {{Is No}} `{{Rule}} of {{Thumb}}': {{Genomic Filter Settings}} for a {{Small Plant Population}} to {{Obtain Unbiased Gene Flow Estimates}}},
shorttitle = {There {{Is No}} `{{Rule}} of {{Thumb}}'},
author = {Nazareno, Alison G. and Knowles, L. Lacey},
year = {2021},
journal = {Frontiers in Plant Science},
volume = {12},
issn = {1664-462X},
url = {https://www.frontiersin.org/articles/10.3389/fpls.2021.677009},
urldate = {2023-11-01},
abstract = {The application of high-density polymorphic single-nucleotide polymorphisms (SNP) markers derived from high-throughput sequencing methods has heralded plenty of biological questions about the linkages of processes operating at micro- and macroevolutionary scales. However, the effects of SNP filtering practices on population genetic inference have received much less attention. By performing sensitivity analyses, we empirically investigated how decisions about the percentage of missing data (MD) and the minor allele frequency (MAF) set in bioinformatic processing of genomic data affect direct (i.e., parentage analysis) and indirect (i.e., fine-scale spatial genetic structure \textendash{} SGS) gene flow estimates. We focus specifically on these manifestations in small plant populations, and particularly, in the rare tropical plant species Dinizia jueirana-facao, where assumptions implicit to analytical procedures for accurate estimates of gene flow may not hold. Avoiding biases in dispersal estimates are essential given this species is facing extinction risks due to habitat loss, and so we also investigate the effects of forest fragmentation on the accuracy of dispersal estimates under different filtering criteria by testing for recent decrease in the scale of gene flow. Our sensitivity analyses demonstrate that gene flow estimates are robust to different setting of MAF (0.05\textendash 0.35) and MD (0\textendash 20\%). Comparing the direct and indirect estimates of dispersal, we find that contemporary estimates of gene dispersal distance ({$\sigma$}rt = 41.8 m) was {$\sim$} fourfold smaller than the historical estimates, supporting the hypothesis of a temporal shift in the scale of gene flow in D. jueirana-facao, which is consistent with predictions based on recent, dramatic forest fragmentation process. While we identified settings for filtering genomic data to avoid biases in gene flow estimates, we stress that there is no `rule of thumb' for bioinformatic filtering and that relying on default program settings is not advisable. Instead, we suggest that the approach implemented here be applied independently in each separate empirical study to confirm appropriate settings to obtain unbiased population genetics estimates.},
}
@book{nei_MolecularEvolutionPhylogenetics_2000,
title = {Molecular {{Evolution}} and {{Phylogenetics}}},
author = {Nei, Masatoshi and Kumar, Sudhir},
Expand All @@ -1498,6 +1510,7 @@ @book{nei_MolecularEvolutionPhylogenetics_2000
abstract = {This book presents the statistical methods that are useful in the study of molecular evolution and illustrates how to use them in actual data analysis. Molecular evolution has been developing at a great pace over the past decade or so, driven by the huge increase in genetic sequence data from many organisms, the improvement of high-speed microcomputers, and the development of several new methods for phylogenetic analysis. This book for graduate students and researchers, assuming a basic knowledge of evolution, molecular biology, and elementary statistics, should make it possible for many investigators to incorporate refined statistical analysis of large-scale data in their own work. Nei is one of the leading workers in this area. He and Kumar have developed a computer program called MEGA, which has been sold for about \$20 to over 1900 users. For the book, the authors are thoroughly revising MEGA and will make it available via FTP. The book also included analysis using the other most popular programs for phylogenetic studies, including PAUP, PHYLIP, MOLPHY, and PAML.},
isbn = {978-0-19-513585-5},
}

@article{nielsen_GenotypeSNPCalling_2011,
title = {Genotype and {{SNP}} Calling from Next-Generation Sequencing Data},
author = {Nielsen, Rasmus and Paul, Joshua S. and Albrechtsen, Anders and Song, Yun S.},
Expand All @@ -1515,7 +1528,6 @@ @article{nielsen_GenotypeSNPCalling_2011
langid = {english},
keywords = {Bioinformatics,Genomics,Next-generation sequencing,Population genetics,Technology},
}

@article{nielsen_MolecularSignaturesNatural_2005,
title = {Molecular {{Signatures}} of {{Natural Selection}}},
author = {Nielsen, Rasmus},
Expand Down
1 change: 0 additions & 1 deletion docs/exercises/demography/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
title: "Demographic inference"
author:
- "André Soares"
- "Per Unneberg"
format: html
---

Expand Down
11 changes: 10 additions & 1 deletion docs/exercises/genetic_diversity/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,15 @@ compile statistics along a sequence. By scanning variation in windows
along the sequence (a.k.a. genomic scan) we can identify outlier
regions whose pattern of variation could potentially be attributed to
causes other than neutral processes, such as adaptation or migration.
We will use the Monkeyflower system to generate a diversity landscape.

::: {.callout-important}

The commands of this document have been run on a subset (a subregion)
of the data. Therefore, although you will use the same commands, your
results will differ from those presented here.

:::

::: {.callout-tip collapse=true}

Expand Down Expand Up @@ -91,7 +100,7 @@ Execute the following command to load modules:
module load uppmax bioinfo-tools \
bcftools/1.17 \
BEDTools/2.29.2 \
vcftools/0.1.16
vcftools/0.1.16
```

`csvtk` has been added to the module system and can be loaded as
Expand Down
13 changes: 6 additions & 7 deletions docs/exercises/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -44,15 +44,14 @@ URL that hosts the actual exercise instructions.
## On self-assessment exercise blocks

Scattered throughout the documents are exercise blocks, with hidden
answers, and in some cases, hints. The exercises are for
answers, and, in some cases, hints. The exercises are for
self-assessment of your understanding, but they are not mandatory.

Some of the exercises (labelled with the linux penguin {{< fa
brands linux >}}) are related to the usage of the command line
interfaces (CLI), and how to obtain information about what a program
does. This is an essential skill when working in Linux/UNIX
environments! These exercises can be skipped if you are an experienced
Linux/UNIX user.
Some of the exercises (labelled with the Linux penguin {{< fa brands
linux >}}) are related to the usage of the command line interfaces
(CLI), and how to obtain information about what a program does. This
is an essential skill when working in Linux/UNIX environments! These
exercises can be skipped if you are an experienced Linux/UNIX user.

An example exercise is provided here:

Expand Down
1 change: 0 additions & 1 deletion docs/exercises/population_structure/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
title: "Population structure"
author:
- "Nikolay Oskolkov"
- "Per Unneberg"
format: html
---

Expand Down
1 change: 0 additions & 1 deletion docs/exercises/selection/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
title: Selection
author:
- "Jason Hill"
- "Per Unneberg"
format: html
---

Expand Down
17 changes: 17 additions & 0 deletions docs/exercises/variant_filtering/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,14 @@ sites only, followed by an approach that filters on sequencing depth
in a variant file containing both variant and invariant sites. The
latter methodology can then be generalized to generate depth-based filters from BAM files.

::: {.callout-important}

The commands of this document have been run on a subset (a subregion)
of the data. Therefore, although you will use the same commands, your
results will differ from those presented here.

:::

::: {.callout-tip collapse=true}

## Learning objectives
Expand Down Expand Up @@ -637,6 +645,15 @@ deficit of heterozygotes (and consequently, positive F) simply due to
something called the [Wahlund
effect](https://en.wikipedia.org/wiki/Wahlund_effect).

::: {.callout-warning}

The inbreeding coefficient is a population-level statistic and is not
reliable for small sample sizes ($n<10$, say). Therefore, our sample
size is in the lower range and the results should be taken with a
grain of salt.

:::

::: {.callout-exercise}

Use `bcftools view -s SAMPLENAMES | vcftools --vcf - --het --stdout` to
Expand Down
1 change: 0 additions & 1 deletion docs/slides/demography/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
title: "Demographic inference"
author:
- "Andre Soares"
- "Per Unneberg"
format: html
---

Expand Down
1 change: 0 additions & 1 deletion docs/slides/selection/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
title: "Selection"
author:
- "Jason Hill"
- "Per Unneberg"
format: html
---

Expand Down
103 changes: 101 additions & 2 deletions docs/slides/variant_filtering/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,104 @@ Source: [@lou_BeginnerGuideLowcoverage_2021]
:::

## Guidelines? What guidelines? {.smallr}

#### GATK hard filters

> However, because we want to help, we have formulated some generic
> recommendations that should at least provide a starting point for
> people to experiment with their data.
::: {.fragment}

:::: {.columns}

::: {.column width="50%"}

##### SNPs

```
QualByDepth (QD) < 2.0
RMSMappingQuality (MQ) < 40.0
FisherStrand (FS) > 60.0
StrandOddsRatio (SOR) > 3.0
MappingQualityRankSumTest (MQRankSum) < -12.5
ReadPosRankSumTest (ReadPosRankSum) < -8.0
```

:::

::: {.column width="50%"}

##### Indels

```
QualByDepth (QD) < 2.0
ReadPosRankSum (ReadPosRankSumTest) < -20.0
InbreedingCoeff < -0.8
FisherStrand (FS) > 200.0
StrandOddsRatio (SOR) > 10.0
```

:::

::::

:::

::: {.fragment}

> That said, you ABSOLUTELY SHOULD NOT expect to run these commands
> and be done with your analyses.
:::

::: {.flushright .smallest .translatey50}

<https://gatk.broadinstitute.org/hc/en-us/articles/360037499012>

:::

::: {.fragment}

#### On RAD-seq filtering

> ... the effects of SNP filtering practices on population genetic
> inference have received much less attention
::: {.flushright .smallest .translatey50 }

There Is No ‘Rule of Thumb’: Genomic Filter Settings for a Small Plant
Population to Obtain Unbiased Gene Flow Estimates
[@nazareno_ThereNoRule_2021]

:::

:::

::: {.notes}

General guidelines on manual filters are not discussed much in the
literature, simply due to the fact that there is no set of rule of
thumbs. Every problem requires its own settings, as the GATK
developers maintain.

GATK guidelines explained (see <https://gatk.broadinstitute.org/hc/en-us/articles/360035890471>):

- QualByDepth (QFD): variant confidence (QUAL) divided by unfiltered depth
- FisherStrand (FS): checks for strand bias (i.e., if minor allele
occurs more often on one strand)
- StrandOddsRatio (SOR): alternative strand bias test
- RMSMappingQuality (MQ): root mean square mapping quality over all
reads
- MappingQualytRankSumTest (MQRankSum): compares mapping qualities of
ref and alt alleles
- ReadPosRankSumTest (ReadPosRankSum): looks at site position within reads
- InbreedingCoeff: population-level statistics that requires at least
10 individuals

:::

## What about machine learning?

:::: {.columns}
Expand Down Expand Up @@ -254,7 +352,7 @@ dimensions

::: {.fragment}

Database of known variants often *not* known for non-model organisms.
Caveat: database of known variants often *not* known for non-model organisms.

:::

Expand Down Expand Up @@ -448,7 +546,8 @@ ggplot(data, aes(x=F, y=INDV)) + geom_point(size=3) +
```

- F=0: Hardy-Weinberg Equilibrium
- F>0: deficit of heterozygotes; inbreedeng, allele dropout
- F>0: deficit of heterozygotes; inbreeding, Wahlund
effect (population substructure), allele dropout
- F<0: surplus of heterozygotes; could be sample contamination, poor
sequence quality (mismapping)

Expand Down

0 comments on commit 24fe760

Please sign in to comment.