-
Notifications
You must be signed in to change notification settings - Fork 2
4. A typical workflow
As an example if there are 5 samples <S1, S2, S3, S4, and S5> we run the following commands. A typical pipleline to estimate the abundance of the contigs of S1 from the reads of all the other samples looks like,
1. Assembling the reads of Sample S1.
megahit -1 S1.READS-1.fq.gz -2 S1.READS-2.fq.gz -o S1_Assembly --out-prefix S1 --verbose
The contigs are available at S1_Assembly/S1.contigs.fa
2. Scaffolding the sample S1.
a. bowtie2-build S1_Assembly/S1.contigs.fa S1_Assembly/S1.idx
b. bowtie2 -x S1_Assembly/S1.idx -U S1.READS-1.fq.gz -2,S1.READS-2.fq.gz --no-unal |\
samtools view -bS > S1_merged.bam
c. samtools sort -n S1_merged.bam -o S1_sorted.bam
d. rm S1_merged.bam
e. python run.py -a S1_Assembly/S1.contigs.fa -m S1_sorted.bam -d S1_Scaffolds -k true -r true
f. python find_motifs.py -d S1_Scaffolds
3. Estimate per-base coverages from all samples
mkdir S1_Coverages
a. samtools sort S1_sorted.bam | genomeCoverageBed -ibam stdin -bga -split > S1_Coverages/S1.txt
LC_ALL=C sort -k1,1 S1_Coverages/S1.txt > S1_Coverages/S1.sorted.txt
rm S1_Coverages/S1.txt
b. bowtie2 -x S1_Assembly/S1.idx -U S2.READS-1.fq.gz,S2.READS-2.fq.gz --no-unal |\
samtools view -bS | samtools sort |\
genomeCoverageBed -ibam stdin -bga -split > S1_Coverages/S1_S2.txt
LC_ALL=C sort -k1,1 S1_Coverages/S1_S2.txt > S1_Coverages/S1_S2.sorted.txt
rm S1_Coverages/S1_S2.txt
c. bowtie2 -x S1_Assembly/S1.idx -U S3.READS-1.fq.gz,S3.READS-2.fq.gz --no-unal |\
samtools view -bS | samtools sort |\
genomeCoverageBed -ibam stdin -bga -split > S1_Coverages/S1_S3.txt
LC_ALL=C sort -k1,1 S1_Coverages/S1_S3.txt > S1_Coverages/S1_S3.sorted.txt
rm S1_Coverages/S1_S3.txt
d. bowtie2 -x S1_Assembly/S1.idx -U S4.READS-1.fq.gz,S4.READS-2.fq.gz --no-unal |\
samtools view -bS | samtools sort |\
genomeCoverageBed -ibam stdin -bga -split > S1_Coverages/S1_S4.txt
LC_ALL=C sort -k1,1 S1_Coverages/S1_S4.txt > S1_Coverages/S1_S4.sorted.txt
rm S1_Coverages/S1_S4.txt
e. bowtie2 -x S1_Assembly/S1.idx -U S5.READS-1.fq.gz,S5.READS-2.fq.gz --no-unal |\
samtools view -bS | samtools sort |\
genomeCoverageBed -ibam stdin -bga -split > S1_Coverages/S1_S5.txt
LC_ALL=C sort -k1,1 S1_Coverages/S1_S5.txt > S1_Coverages/S1_S5.sorted.txt
rm S1_Coverages/S1_S5.txt
4. Run binnacle to compute coverages in the following order.
a. python Estimate_Abundances.py -g S1_Scaffolds/oriented.gml -a S1_Coverages/S1.sorted.txt \
-c S1_Assembly/S1.contigs.fa -d S1_Binnacle_Coverages -pre S1
b. python Estimate_Abundances.py -o S1_Binnacle_Coverages/Coords_After_Delinking.txt\
-a S1_Coverages/S1_S2.sorted.txt -d S1_Binnacle_Coverages -pre S1_S2
c. python Estimate_Abundances.py -o S1_Binnacle_Coverages/Coords_After_Delinking.txt\
-a S1_Coverages/S1_S3.sorted.txt -d S1_Binnacle_Coverages -pre S1_S3
d. python Estimate_Abundances.py -o S1_Binnacle_Coverages/Coords_After_Delinking.txt\
-a S1_Coverages/S1_S4.sorted.txt -d S1_Binnacle_Coverages -pre S1_S4
e. python Estimate_Abundances.py -o S1_Binnacle_Coverages/Coords_After_Delinking.txt\
-a S1_Coverages/S1_S5.sorted.txt -d S1_Binnacle_Coverages -pre S1_S5
f. python Collate.py -d S1_Binnacle_Coverages -m metabat
The files S1_Binnacle_Coverages/Feature-Matrix-metabat.txt
and S1_Binnacle_Coverages/Scaffolds.fasta
can be used as inputs to metabat2 to perform binning.
-
Assuming a cluster is available to process these samples, all the commands in step 3 can be run concurrently. Similarly commands 4b-4e can be run concurrently. However it requires that 4a is run before running any of the other jobs.
-
If the alignments are avaialble as bam or bed files, step 3 can be skipped and the alignment files can be provided as inputs to binnacle in step 4.
-
To prepare inputs to bin all other <S2, S3, S4, and S5> we repeat the same set of steps.