Skip to content

4. A typical workflow

Harihara Subrahmaniam Muralidharan edited this page Jul 3, 2021 · 7 revisions

As an example if there are 5 samples <S1, S2, S3, S4, and S5> we run the following commands. A typical pipleline to estimate the abundance of the contigs of S1 from the reads of all the other samples looks like,

1. Assembling the reads of Sample S1.

megahit -1 S1.READS-1.fq.gz -2 S1.READS-2.fq.gz -o S1_Assembly --out-prefix S1 --verbose

The contigs are available at S1_Assembly/S1.contigs.fa


2. Scaffolding the sample S1.

a. bowtie2-build S1_Assembly/S1.contigs.fa S1_Assembly/S1.idx

b. bowtie2 -x S1_Assembly/S1.idx -U S1.READS-1.fq.gz -2,S1.READS-2.fq.gz --no-unal |\
   samtools view -bS > S1_merged.bam

c. samtools sort -n S1_merged.bam -o S1_sorted.bam 

d. rm S1_merged.bam

e. python run.py -a S1_Assembly/S1.contigs.fa -m S1_sorted.bam -d S1_Scaffolds -k true -r true

f. python find_motifs.py -d S1_Scaffolds


3. Estimate per-base coverages from all samples

mkdir S1_Coverages

a. samtools sort S1_sorted.bam | genomeCoverageBed -ibam stdin -bga -split > S1_Coverages/S1.txt
   LC_ALL=C sort -k1,1 S1_Coverages/S1.txt > S1_Coverages/S1.sorted.txt
   rm S1_Coverages/S1.txt

b. bowtie2 -x S1_Assembly/S1.idx -U S2.READS-1.fq.gz,S2.READS-2.fq.gz --no-unal |\
   samtools view -bS | samtools sort |\
   genomeCoverageBed -ibam stdin -bga -split > S1_Coverages/S1_S2.txt
   LC_ALL=C sort -k1,1 S1_Coverages/S1_S2.txt > S1_Coverages/S1_S2.sorted.txt
   rm S1_Coverages/S1_S2.txt

c. bowtie2 -x S1_Assembly/S1.idx -U S3.READS-1.fq.gz,S3.READS-2.fq.gz --no-unal |\
   samtools view -bS | samtools sort |\
   genomeCoverageBed -ibam stdin -bga -split > S1_Coverages/S1_S3.txt
   LC_ALL=C sort -k1,1 S1_Coverages/S1_S3.txt > S1_Coverages/S1_S3.sorted.txt
   rm S1_Coverages/S1_S3.txt

d. bowtie2 -x S1_Assembly/S1.idx -U S4.READS-1.fq.gz,S4.READS-2.fq.gz --no-unal |\
   samtools view -bS | samtools sort |\
   genomeCoverageBed -ibam stdin -bga -split > S1_Coverages/S1_S4.txt
   LC_ALL=C sort -k1,1 S1_Coverages/S1_S4.txt > S1_Coverages/S1_S4.sorted.txt
   rm S1_Coverages/S1_S4.txt

e. bowtie2 -x S1_Assembly/S1.idx -U S5.READS-1.fq.gz,S5.READS-2.fq.gz --no-unal |\
   samtools view -bS | samtools sort |\
   genomeCoverageBed -ibam stdin -bga -split > S1_Coverages/S1_S5.txt
   LC_ALL=C sort -k1,1 S1_Coverages/S1_S5.txt > S1_Coverages/S1_S5.sorted.txt
   rm S1_Coverages/S1_S5.txt


4. Run binnacle to compute coverages in the following order.

a. python Estimate_Abundances.py -g S1_Scaffolds/oriented.gml -a S1_Coverages/S1.sorted.txt \
   -c S1_Assembly/S1.contigs.fa -d S1_Binnacle_Coverages -pre S1

b. python Estimate_Abundances.py -o S1_Binnacle_Coverages/Coords_After_Delinking.txt\
   -a S1_Coverages/S1_S2.sorted.txt -d S1_Binnacle_Coverages -pre S1_S2

c. python Estimate_Abundances.py -o S1_Binnacle_Coverages/Coords_After_Delinking.txt\
   -a S1_Coverages/S1_S3.sorted.txt -d S1_Binnacle_Coverages -pre S1_S3

d. python Estimate_Abundances.py -o S1_Binnacle_Coverages/Coords_After_Delinking.txt\
   -a S1_Coverages/S1_S4.sorted.txt -d S1_Binnacle_Coverages -pre S1_S4

e. python Estimate_Abundances.py -o S1_Binnacle_Coverages/Coords_After_Delinking.txt\
   -a S1_Coverages/S1_S5.sorted.txt -d S1_Binnacle_Coverages -pre S1_S5

f. python Collate.py -d S1_Binnacle_Coverages -m metabat

The files S1_Binnacle_Coverages/Feature-Matrix-metabat.txt and S1_Binnacle_Coverages/Scaffolds.fasta can be used as inputs to metabat2 to perform binning.

Notes

  1. Assuming a cluster is available to process these samples, all the commands in step 3 can be run concurrently. Similarly commands 4b-4e can be run concurrently. However it requires that 4a is run before running any of the other jobs.

  2. If the alignments are avaialble as bam or bed files, step 3 can be skipped and the alignment files can be provided as inputs to binnacle in step 4.

  3. To prepare inputs to bin all other <S2, S3, S4, and S5> we repeat the same set of steps.

Clone this wiki locally