Skip to content

Library for calculating codon and codon pair conservation rates across species of a given genus

License

Notifications You must be signed in to change notification settings

SouthernBio/BioSeeker

Repository files navigation

BioSeeker: Python library for the analysis of codon/bicodon conservation rates across linked species

BioSeeker (c) 2023 All rights reserved This project facilitates calculating codon and bicodon conservation rates for a given genus.

Useful Links:

1. About the project

Open-source bioinformatics project licensed under GPL v3.0 The inspiration for this project came from this paper, which I've tried to (partially) replicate using Drosophila's alignments from FlyDIVaS. Feel free to make as many additions as you'd like.

2. How does it work?

In this repo you will find a Python script called bioseeker.py. It takes a file (or a group of files) as input, which contains homologous genes previously aligned, in FASTA format. After parsing the file(s) for data extraction, it creates a matrix using NumPy, in order to iterate across matrix slices. The obtained information (codon count from reference sequence, and number of times that said codon was conserved across species) is stored in a CSV file, which is created using Pandas. For each MSA file two types of dataframes will be generated - one for codons, and another for codon pairs. The algorithm calculates codon/bicodon conservation rates across all 3 reading frames. So, there will be a total of 6 CSV files that will be generated (2 for each reading frame).

3. Installing and running the program

Start by cloning the repository:

$ git clone https://github.com/SouthernBio/BioSeeker

Copy and paste the MSA FASTA files on the directory where bioseeker.py is located.

Make sure that your Python interpreter is added to PATH. Then, you can activate the virtual environment and run BioSeeker.

Windows PowerShell:

$ pipenv shell
$ ./bioseeker # or 'py -m bioseeker'

GNU/Linux:

$ pipenv shell
$ bioseeker # or 'python3 -m bioseeker"

If you want to test how the program works before using it on your data, you will find alignment files on FASTA_files/.

4. Dependencies

To execute this script you must install Python, Git (if you want to clone the repo with Git) and its package manager, pip. You can do it on Ubuntu through the terminal:

$ sudo apt-get update
$ sudo apt-get install python3 python3-pip git-all

Once you have installed Python and its package manager, you can proceed to install pipenv:

$ pip3 install pipenv

5. Additional details

After parsing the files and calculating conservation rates, it will also generate a file called unreadable.txt which stores the names of MSA files that could not be parsed. Then, it will assemble all individual dataframes into 6 different dataframes that contain all the information across linked species. BioSeeker will automatically create a new directory called dataframes/ which will contain all the new dataframes.

💙 Support this project

Your contribution would help SouthernBio in improving the quality of this project and adding additional features. If you find this project useful and/or interesting, please consider offering your support on Github Sponsors, Ko-Fi or PayPal

Github-sponsors Ko-Fi PayPal

About

Library for calculating codon and codon pair conservation rates across species of a given genus

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages