Releases: cov-lineages/pangolin
pangolin v2.0.5
Release notes
- Memory issues resolved (thanks @ArtPoon!)
- prep package removed from repo, now hosted at cov-support
- shebang fix
pangolin v2.0.4
Release notes
Minor fix
- Exit with 0 code rather than -1 code if no sequences pass qc
pangolin v2.0.3
Release notes
Minor bug fixes:
- if no querys pass qc now produces a file with reason
- specific file name for metadata file in command.py checks
pangolin v2.0.2
Release notes
minor patches
- if a file with no sequences that meet the qc thresholds is input, exit with informative error
- rework logic in command.py if datadir explicitly given rather than looking in pkg resources by default (thanks @ Anthony Underwood!)
pangolin v2.0.1
Patch of case typo when checking if snakefile is found, not caught on OSX system.
pangolin v2.0
Release notes pangolin 2.0
This release of pangolin comes with some major changes, including a significant speed-up and improvements in assignment accuracy for larger lineages. The new assignment algorithm (that we have termed pangoLEARN) is described in detail below. One significant benefit of this approach over the previous algorithm is that it allows us to incorporate all of the diversity of the large lineages into the assignment system rather than just picking a select few. This approach will also improve our approach to homoplasies in the phylogeny as these sites would likely not be informative. We have pulled out informative sites and this information is included in the data release on pangoLEARN. The top SNPs that are most positively and negatively associated with a given lineage are detailed in those files.
Practical information for the user include the following:
-
data is now being pulled from cov-lineages/pangoLEARN rather than cov-lineages/lineages. This is accounted for in the conda environment.yml file but for those not using conda, this data will need to be pip installed. Other new dependencies include minimap2 and datafunk (also pip installable via git+https://github.com/cov-ert/datafunk.git).
-
The previous algorithm is still accessible using the
--legacy
flag, but for the most recent data release information we encourage you to use pangolin 2.0. -
Use of pangolin remains the same
pangolin <your-query-fasta>
-
The output csv now only has a single support column (assignment probability) rather than the previous UFbootstrap and aLRT values. The original format is output if using
--legacy
-
Our intentions going forward are to phase out the legacy algorithm as it was struggling to scale with the increase in lineage number and sequences but it is still available in the current release of pangolin.
-
pangoLEARN contains information about the top SNPs that are most positively and negatively associated with a given lineage. The lineage recall report is also available in this repository.
pangoLEARN details
pangoLEARN is an alternative algorithm for lineage assignment, implemented as of pangolin 2.0. This new algorithm, which relies on machine learning, offers much faster lineage assignment, as the phylogenetic approach was struggling to scale with the increase in number of lineages needing to be represented in the guide tree. This new approach also takes into account all of the diversity present within a lineage rather than just selecting a representative few. The consequences of this approach mean that for large lineages, we have improved our recall and precision significantly. We are continuing to develop more sophisticated approaches to machine learning for lineage assignment, which we hope will offer even better improvements in both speed and accuracy.
The current version of pangoLEARN uses multinomial logistic regression, but the pipeline has been written so that as more complex models are developed,the user will be able to choose which model to use to assign their lineages.
While in standard regression a line of best fit is found for a set of training data, which represents a linear relationship between variables of interest, a logistic regression fits a sigmoid function to the training data, in order to tell two different classes apart. A multinomial logistic regression is an extension of a standard logistic regression in that it can be used to classify more than two classes. Each potential assignment (i.e. lineage) is modeled as a set of n-1 independent binary choices (sigmoid functions), where n is the number of classes.
The model was trained using 30,000 SARS-CoV-2 sequences from GISAID (acknowledgements here), their assigned lineages being manually curating the global ML tree, as is the standard lineages data release procedure for pangolin. Each base of each genome was one-hot encoded. This left us with a large number of parameters to train, which is why training this model takes approximately 14 hours on our hardware (may change with different hardware). This model was built using the standard sci-kit learn implementation of multinomial logistic regression. The code for this process is available in the cov-lineages/cov-support repository.
Multinomial logistic regression is an extremely commonly used model as it is able to simply and intuitively assign probabilities to class assignments. However, it does not incorporate any hierarchical structure. We are currently developing new models that do incorporate hierarchical structure. However, given the limitations of this simple model, it has performed surprisingly well with this data. While more complex models may offer improvements in assignment accuracies for smaller lineages, the logistic regression has the advantages of being intuitive, easy to implement, and relatively fast to train.
Contributions
Emily Scher and Áine O'Toole have worked together to develop pangolin 2.0
pangolin v1.1.14
Updates in this release:
- Now by default
pangolin
only uses thesafe
lineages versions in thelineages
release. This translates to only using lineages that are > 95% recall rate. With the-p
or--include-putative
flags, all lineages including those with potentially less certainty are included. We believe this will be a useful feature. Putative lineages are indicated with ap
before their designation. E.g.B.1.1.p15
lies with certainty within the lineageB.1.1
. Putative lineages fit the criteria required for lineage designation, but potentially due to homoplasies, sequencing errors or resolution of the global tree (>27,000 tips now), have not got recall values suitable for lineage assignment. The rationale will be if more data continues to support these lineages, thep
will be removed and they will become part of the default lineage groups. - Developments towards SNP based classification, not integrated into the master pipeline though.
pangolin v1.1.13
### Updates in this release:
- prep pipeline lowering representative minimum to 3 instead of 5.
- bug fix for lineages with long names.
pangolin v1.1.12
Updates in this release
- All paths now specified by os.path.join().
- Compatible with windows paths now.
- Fixed bug in pulling location from taxon name
pangolin v1.1.11
Updates:
- All paths now created using os.path.join()