##Taming Pathogens
Pathogens have played a crucial role in human history. The most iconic is the Black Death, which was (is thought to have been) caused by bacteria named Yersinia pestis in the 14th century. A more recent example is AIDS caused by human immunodeficiency virus (HIV). Hence, understanding fundamentals of the host-pathogen interactions has been a central problem in epidemiology, which with the genomic age requires tools from quantitative fields to make sense of the large amount of sequencing data that is being generated.
A virus-bacteria interaction provides the simplest host-pathogen pair and is crucial for functioning of human gut to marine ecosystems. Interestingly, a mechanism for “adaptive” immune system in bacteria against the viruses infecting them (usually referred to as phages) was discovered just a few years ago, called CRISPR-Cas. Curiously, much like the anti-virus software and intrusion detection systems that rely on detecting patterns found in malicious code, bacteria keeps a dynamic library of small pieces of phage genomes (spacers) to detect and neutralize phage attacks.
The basic problem of understanding how this immune system works is to understand the pattern of spacers on phage genomes: how many per phage genome, where on a phage genome, if the spacers containing regions are more or less dynamic compared to the rest of the phage genome, etc. Since we have a large number of sequenced phages and a library of spacers from a variety of bacteria - ranging from deadly human pathogens such as tuberculosis to bacteria that live in our guts - we can attempt to aggregate this information to develop a more “complete” understanding of phage-bacteria interactions.
##Data Challenge
Happily, much of the existing data needed to understand bacteria / phage interaction has been released openly to the public and is available over the web; the current challenge is to help extract the relevant parts of that huge database, and automate the production of targeted datasets for these studies. More details are in the issue tracker!
##Relevant Literature
CRISPR-Cas Systems: Prokaryotes Upgrade to Adaptive Immunity: a very good review paper on the CRISPR-cas system, the biological backdrop of this project.
##Installation
This package depends on Biopython:
sudo make install-deps
##Usage
phageParser uses Make to build the project and to run its pipeline. Make is available for most Operating Systems, and you can learn more about it reading the GNU Make manual and the O'Reilly GNU Make book.
-
To get a phage dataset, take a fasta-formatted list of genes (example in
data/velvet-distinct-spacers.fasta
) and upload to http://phagesdb.org/blast/ - example result indata/blast-phagesdb.txt
-
To clean up the results returned from phagesdb.org, you can call the Make target filter_by_expect, as in the example below.
make filter_by_expect infile=data/blast-phagesdb.txt output=output/ threshold=0.21
The result will be written to a file in output/
, in a CSV formatted as
Query, Name, Length, Score, Expect, QueryStart, QueryEnd, SubjectStart, SubjectEnd
with one header row (see #1 for discussion and details)
- To query NCBI for full genomes, do
cat accessionNumber.txt | python acc2gb.py [email protected] > NCBIresults.txt
where accessionNumber.txt
contains a list of accession numbers of interest; results will be dumped to NCBIresults.txt
- see #2 for ongoing development here.
##Alternative usage for code sprint materials
All of the following assumes you are using the reference CRISPR database set of spacers (file spacerdatabase.txt
). Things should work with other spacer files; however there are several things hard-coded that might break. filterByExpect.py
assumes the header line for each spacer is a number, for example, and bac_name
is hardcoded in interactions.py
as the 8th to 16th characters of the file name.
-
To get individual spacer files for each bacteria species in the reference set, run
CRISPR_db_parser
on with the input filespacerdatabase.txt
(downloaded from the Utilities page of CRISPRdb). Put all the output files in a folder/spacers
underdata
. -
Make folders
data/phages
and/output
. The current files indata/spacers
anddata/phages
are examples. -
Blasting of spacer-containing files against the phage database can be done locally (handy if you have many files to blast). Download a local version of blast (blast+) here and find/follow instructions for your OS. (We used these instructions for Windows successfully.) Put the file
Mycobacteriophages-All.fasta
(in data folder) into the main blast+ directory and use the commandmakeblastdb -in "Mycobacteriophages-All.fasta" -dbtype nucl -title PhageDatabase -out phagedb
to create a blast-ready database. It's possible to combine multiple databases into one blastable database by including more than one filename between the quotes in the-in
command (i.e. the ENA phage database or NCBI virus database). Now you should be able to run the scriptBLAST_loop.py
, but make sure directory names are correct - probablyBLAST_loop.py
will need to be run from inside wherever you installed blast+. -
run
filterByExpectPhages.py
, which essentially runs filterByExpect.py on all files in the/phages
folder. These will be saved to/output
. -
make a directory called
sorted
underoutput
. runorderByExpect.py
, which rearranges the results of filterByExpectPhages in each file to be in order of lowest to highest expect value. -
run
interactions.py
, which makes a json filejson.txt
for visualization in cytoscape.js.
Visualization
-
paste the contents of
json.txt
into theelements[]
field in the fileui.js
. This creates the structure needed for cytoscape.js to plot stuff. Various style fields can be changed, see cytoscape.js for documentation (or ask @MaxKFranz for help). -
paste the file
index.html
into a web browser.