Skip to content

Releases: cov-lineages/pangolin

pangolin v2.0.5

07 Aug 14:08
Compare
Choose a tag to compare

Release notes

  • Memory issues resolved (thanks @ArtPoon!)
  • prep package removed from repo, now hosted at cov-support
  • shebang fix

pangolin v2.0.4

23 Jul 10:13
Compare
Choose a tag to compare

Release notes

Minor fix

  • Exit with 0 code rather than -1 code if no sequences pass qc

pangolin v2.0.3

23 Jul 09:17
Compare
Choose a tag to compare

Release notes

Minor bug fixes:

  • if no querys pass qc now produces a file with reason
  • specific file name for metadata file in command.py checks

pangolin v2.0.2

22 Jul 21:48
Compare
Choose a tag to compare

Release notes

minor patches

  • if a file with no sequences that meet the qc thresholds is input, exit with informative error
  • rework logic in command.py if datadir explicitly given rather than looking in pkg resources by default (thanks @ Anthony Underwood!)

pangolin v2.0.1

22 Jul 16:22
0668c25
Compare
Choose a tag to compare

Patch of case typo when checking if snakefile is found, not caught on OSX system.

pangolin v2.0

22 Jul 11:26
7ebb73b
Compare
Choose a tag to compare

Release notes pangolin 2.0

This release of pangolin comes with some major changes, including a significant speed-up and improvements in assignment accuracy for larger lineages. The new assignment algorithm (that we have termed pangoLEARN) is described in detail below. One significant benefit of this approach over the previous algorithm is that it allows us to incorporate all of the diversity of the large lineages into the assignment system rather than just picking a select few. This approach will also improve our approach to homoplasies in the phylogeny as these sites would likely not be informative. We have pulled out informative sites and this information is included in the data release on pangoLEARN. The top SNPs that are most positively and negatively associated with a given lineage are detailed in those files.

Practical information for the user include the following:

  • data is now being pulled from cov-lineages/pangoLEARN rather than cov-lineages/lineages. This is accounted for in the conda environment.yml file but for those not using conda, this data will need to be pip installed. Other new dependencies include minimap2 and datafunk (also pip installable via git+https://github.com/cov-ert/datafunk.git).

  • The previous algorithm is still accessible using the --legacy flag, but for the most recent data release information we encourage you to use pangolin 2.0.

  • Use of pangolin remains the same pangolin <your-query-fasta>

  • The output csv now only has a single support column (assignment probability) rather than the previous UFbootstrap and aLRT values. The original format is output if using --legacy

  • Our intentions going forward are to phase out the legacy algorithm as it was struggling to scale with the increase in lineage number and sequences but it is still available in the current release of pangolin.

  • pangoLEARN contains information about the top SNPs that are most positively and negatively associated with a given lineage. The lineage recall report is also available in this repository.

pangoLEARN details

pangoLEARN is an alternative algorithm for lineage assignment, implemented as of pangolin 2.0. This new algorithm, which relies on machine learning, offers much faster lineage assignment, as the phylogenetic approach was struggling to scale with the increase in number of lineages needing to be represented in the guide tree. This new approach also takes into account all of the diversity present within a lineage rather than just selecting a representative few. The consequences of this approach mean that for large lineages, we have improved our recall and precision significantly. We are continuing to develop more sophisticated approaches to machine learning for lineage assignment, which we hope will offer even better improvements in both speed and accuracy.

The current version of pangoLEARN uses multinomial logistic regression, but the pipeline has been written so that as more complex models are developed,the user will be able to choose which model to use to assign their lineages.

While in standard regression a line of best fit is found for a set of training data, which represents a linear relationship between variables of interest, a logistic regression fits a sigmoid function to the training data, in order to tell two different classes apart. A multinomial logistic regression is an extension of a standard logistic regression in that it can be used to classify more than two classes. Each potential assignment (i.e. lineage) is modeled as a set of n-1 independent binary choices (sigmoid functions), where n is the number of classes.

The model was trained using 30,000 SARS-CoV-2 sequences from GISAID (acknowledgements here), their assigned lineages being manually curating the global ML tree, as is the standard lineages data release procedure for pangolin. Each base of each genome was one-hot encoded. This left us with a large number of parameters to train, which is why training this model takes approximately 14 hours on our hardware (may change with different hardware). This model was built using the standard sci-kit learn implementation of multinomial logistic regression. The code for this process is available in the cov-lineages/cov-support repository.

Multinomial logistic regression is an extremely commonly used model as it is able to simply and intuitively assign probabilities to class assignments. However, it does not incorporate any hierarchical structure. We are currently developing new models that do incorporate hierarchical structure. However, given the limitations of this simple model, it has performed surprisingly well with this data. While more complex models may offer improvements in assignment accuracies for smaller lineages, the logistic regression has the advantages of being intuitive, easy to implement, and relatively fast to train.

Contributions

Emily Scher and Áine O'Toole have worked together to develop pangolin 2.0

pangolin v1.1.14

23 May 12:47
a2b0a69
Compare
Choose a tag to compare

Updates in this release:

  • Now by default pangolin only uses the safe lineages versions in the lineages release. This translates to only using lineages that are > 95% recall rate. With the -p or --include-putative flags, all lineages including those with potentially less certainty are included. We believe this will be a useful feature. Putative lineages are indicated with a p before their designation. E.g. B.1.1.p15 lies with certainty within the lineage B.1.1. Putative lineages fit the criteria required for lineage designation, but potentially due to homoplasies, sequencing errors or resolution of the global tree (>27,000 tips now), have not got recall values suitable for lineage assignment. The rationale will be if more data continues to support these lineages, the p will be removed and they will become part of the default lineage groups.
  • Developments towards SNP based classification, not integrated into the master pipeline though.

pangolin v1.1.13

14 May 15:46
Compare
Choose a tag to compare

### Updates in this release:

  • prep pipeline lowering representative minimum to 3 instead of 5.
  • bug fix for lineages with long names.

pangolin v1.1.12

12 May 14:43
Compare
Choose a tag to compare

Updates in this release

  • All paths now specified by os.path.join().
  • Compatible with windows paths now.
  • Fixed bug in pulling location from taxon name

pangolin v1.1.11

08 May 13:24
Compare
Choose a tag to compare

Updates:

  • All paths now created using os.path.join()