Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIV workflow error #168

Open
antoine4ucsd opened this issue Dec 28, 2024 · 18 comments
Open

HIV workflow error #168

antoine4ucsd opened this issue Dec 28, 2024 · 18 comments
Assignees

Comments

@antoine4ucsd
Copy link

Hello
I am very interested in evaluating v-pipe
I followed the quick-install instruction on my mac (latest gen)

curl -O 'https://raw.githubusercontent.com/cbg-ethz/V-pipe/master/utils/quick_install.sh'
bash quick_install.sh -w working
cd ./working

I set up the config and samples.tsv files (attached)
but I get attached error error when running

./vpipe --jobs 8 --printshellcmds

logs.zip

thank you in advance for your help

@DrYak DrYak self-assigned this Dec 28, 2024
@DrYak
Copy link
Member

DrYak commented Dec 28, 2024

Dear Antoine,

I followed the quick-install instruction on my mac (latest gen)

I suspect the key problem is the my mac (latest gen) part: support for ARM processors such as Apple Silicon in latest gen macs is considered experimental in Bioconda and not all software is available already.
Currently V-pipe doesn't yet support Apple Silicon.
Until we do, I would suggest using Rosetta to run the well-supported Intel version of all Bioconda components.

Regarding the instructions and tutorials, a new version of those will be available soon, courtesy of Google's Season of Docs grant. You can get a preview at: https://gsod-vpipe.readthedocs.io/

I set up the config and samples.tsv files (attached)

Note that you haven't specified a read length in either the config file and/or the 3rd column of the sample file.
V-pipe will automatically assume 250, which could be wrong in your case.

but I get attached error error when running

The main log sadly doesn't give enough information.
Could you please check the individual rules' logs, see this part in snakemake log:

Error in group 7f838ac9-e2e3-42fb-a580-b6bf04675769:
    jobs:
        rule extract:
            …
            log: results/073004LC/RCT/extracted_data/extract_R1.out.log, results/073004LC/RCT/extracted_data/extract_R1.err.log (check log file(s) for error details)
        rule gunzip:
            …
            log: /Volumes/LGHD/_DOLUVOIR/2024-12-26/samples/073004LC/RCT/raw_data/073004LC_RCT_R1_fastq_gunzip.out.log, /Volumes/LGHD/_DOLUVOIR/2024-12-26/samples/073004LC/RCT/raw_data/073004LC_RCT_R1_fastq_gunzip.err.log (check log file(s) for error details)
…

maybe other additional problems lurk beyond the CPU type mentioned above.

@antoine4ucsd
Copy link
Author

thank you for the quick follow up. I will try to set up on a linux platform and keep you posted

@antoine4ucsd
Copy link
Author

just a quick follow up when installing on linux server

bash quick_install.sh -p vp-analysis -w work

I get this error

======================
installing V-pipe
======================

Using branch:	gsod-rtd
quick_install.sh: line 210: git: command not found
Argh: I cannot install branch gsod-rtd.

but git is already installed

git --version
git version 2.23.0

@antoine4ucsd
Copy link
Author

@antoine4ucsd
Copy link
Author

after ~20 min, the run stops
files created for each sample look like that
image

but jobs are not completed. I am attaching the log and yaml.

run on linux server with

./vpipe --jobs 48 --printshellcmds

thoughts?
thank you for your help
logs.zip

@DrYak
Copy link
Member

DrYak commented Dec 29, 2024

FYI: looks like it has to do with (git error above)

As I mentionned, the gsod-vpipe fork is a preview, some details need ironning out... 😅

thoughts?

According to the logs, the custom specialized alligner ngshmmalign that we use for highly variable viruses is crashing:

Error in rule hmm_align:
    jobid: 156
    …
    log: /martinlab/users/achaillon/_doluvoir/results/34003BD/RCT/alignments/ngshmmalign.out.log, /martinlab/users/achaillon/_doluvoir/results/34003BD/RCT/alignments/ngshmmalign.err.log (check log file(s) for error details)

Could you have a look into the log files (see above) specific to this rule job?

Note

Note though that currently I am mostly working on SARS-CoV-2, so my ngshmmalign is a bit rusty. And the former PhD student who wrote that has defended his thesis and moved out eons ago. (Maybe some of my colleague who have ran workshops could answer better once they're back from vacations?)

@antoine4ucsd
Copy link
Author

this is the error I get in align

ERROR: Count argument 'doluvoir-results-034001RS-LN' is not an integral/floating point value! Aborting.

log files attached
if it matters, I am using HXB2 as ref but I am only sequencing partial env. would you suggest to edit the ref fasta/gff? can you provide more details?
thank you!

log.align.zip

@DrYak
Copy link
Member

DrYak commented Dec 29, 2024

Okay found the error.
ngshmmalign is slightly buggy and doesn't like some characters (like _ or .) in either the reference name or the path, due to some sub-optimal way (start from left instead of start from right) it splits the names of temporary files (e.g.: looking for some _*{param}* at the end of the file).

In your specific case it's accidentally picking up the (/**_doluvoir**/results/34003BD) part of the working path and using it for some internal temporary file as a sequence name, and then splitting that name string at **_doluvoir**-results-034001RS-LN instead of the actual floating-point number it should be looking at.

Temporary work-around

Use relative path

You should test putting relative paths in your config file, if the way you have setup you cluster job allows you.
e.g. if your job starts with:

# go into working directory
cd  /martinlab/users/achaillon/vp-analysis/work
# symlink the input and output directories into the currrent working directory
ln -s /martinlab/users/achaillon/_doluvoir/samples ./
ln -s /martinlab/users/achaillon/_doluvoir/results ./
# run  V-pipe
./vpipe --jobs 48 --printshellcmds 

Then you could use this configuration:

#

input:
    samples_file: samples.tsv
    datadir: samples/
    read_length: 150

output:
    datadir: results/
#

This will cause snakemake to use relative paths in commands it generates (compare the Mac's logs from your first post: log: results/073004LC/RCT/extracted_data/extract_R1.out.log, and the current logs from the Linux cluster: log: /martinlab/users/achaillon/_doluvoir/results/34003BD/RCT/alignments/ngshmmalign.out.log)

No _ nor . in path

Alternatively you can try changing the path of the data so that it doesn't contain _ characters (e.g. renaming directories, or using symlinks ln -s _doluvoir doluvoir), so that the absolute path don't throw off ngshmmalign.

Long term fix

I cannot vouch when we will have time to fix. I am the only C/C++ dev available on the team and currently very busy with the respiratory viruses suveillance. It's also hard to hire some 3rd party C/C++ dev as they would need some biologogy/bioinformatics knowledge to understand ngshmmalign.

@antoine4ucsd
Copy link
Author

you are amazing! trying now and will confirm asap
(I assume haplotype algorithm by default is haploclique? and snv_caller is shorah? I can find much details on this and all available options. would be nice to get outputs with all algorithms to compare. as you know, we can have quite big differences between algo. these algorithms also have internal options that can strongly impact the results)

hopefully good news to come soon
thank you!

@DrYak
Copy link
Member

DrYak commented Dec 29, 2024

beside that, here are some additional info:

if it matters, I am using HXB2 as ref

HXB2 is fine.

  • it doesn't have a . in the name so ngshmmalign is happy 🙃
  • ngshmmalign wil take care of successfully mapping reads to highly variable region if it's covered in your data set.

but I am only sequencing partial env

Yes we support sequencing of only some region (incidentally, that's how hiv is tested in our CI/CD tests)

(You might be interested in having a look at primer trimming (section output, property trim_primers) in addition if you want to remove them from the aligned reads. We use that for SARS-CoV-2. V-pipe can use either iVar primerstrim (default) or samtools ampliconclip (selected in the configuration in section general, property primers_trimmer)

@DrYak
Copy link
Member

DrYak commented Dec 29, 2024

I assume haplotype algorithm by default is haploclique? and snv_caller is shorah?

This is controlled in the general section of the configuration file.

  • Currently, the default for global (full) haplotype is SAVAGE
    (though HaploClique is available, as is also PredictHaplo)
  • And the default for local (in windows) haplotype and SNVs is indeed ShoRAH
    (alternatives are: VILOCA, and for SNVs-only butno local haplotype: LoFreq)

I can find much details on this and all available options.

Tools alternatives are controlled in the general section.

In the directory where you checked out V-pipe (that should be /martinlab/users/achaillon/vp-analysis/V-pipe if I understood correctly your install commands), there is a sub-directory config/.

  • Check the file config/README.md, it has a quick intro about the configuration.
  • Check the file config/config.html , it has an exhaustive manual of all exiting options including documenting the default options (sorry, htmlpreview has a bug, I can't link, you need to open your local copy)
  • Check the file config/hiv.yaml, it show the options which are overloaded for the hiv base config (it changes the defaultt aligner to ngshmmalign to better deal with hyper variable regions)

would be nice to get outputs with all algorithms to compare.

Yes, indeed.

@LaraFuhrmann might give you some advice once she's back from holidays, as V-pipe has introduced a new benchmark mode that you could try. There is some documentation in resources/auxiliary_workflows/benchmark/README.md.

For local haplotypes, you could try VILOCA, a more recent evolution of ShoRAH, also Lara's thesis subject -- so she could advise.
It works best if you could provide it information in how you sequenced the region (is the whole envelope gene covered in a single pair of primers that you then fragment randomly and sequence 150 pair-end, or are there one or multiple PCR amplifications whose amplicon you sequence whole? If that's the later case, providing a BED file describing the amplicons can help the results).

For global haplotype, it's more hit and miss. Our testing have shown some edge in PredictHaplo, but all methods crash on some data set. (And we don't have an ETA on fixing PredictHaplo due to the same "not enough C/C++ devs with biology background" situation as ngshmmalign 😅)

In general, the local haplotype reconstruction (based on windows -- either evenly spread with random fragmentations, or aligned to amplicons if full amplicons are sequenced with, e.g., multipex PCR amplicons) tends to be much more mature and stable than global haplotype (full lenght genome where multiple short reads needs to be stitched together).

Also note that despite the 'ShoR' in 'ShoRAH' standing for 'short reads', it also works well on longer reads and especially VILOCA can even leverage the quality score into its new model (very useful on Nanopore long-reads)

So depending on the research question you're trying to answer, local haploypes might work better for you.

@antoine4ucsd
Copy link
Author

all sounds great!
as you can guess, I tested multiple approach : predicthaplo, cliqueSNV, shorah, etc. I never tried VILOCA but will have a look.
for this particular project, it is just short paired end illumina of v2-v3 but I have several ongoing projects going from bulk full genome sequencing to SGA, etc.
I also tested interesting pipelines to assess intactness/
I would be happy to discuss further with your team

@DrYak
Copy link
Member

DrYak commented Dec 29, 2024

Oh, that sounds cool,.

I would be happy to discuss further with your team

It would be good to approach our professor (Prof Niko Bereenwinkel, see our group's page)

@antoine4ucsd
Copy link
Author

just a quick follow up.
after reinstalling my workflow and moving files into directories without '_' , the run started without problem.
I limited my list to 3 samples only (covering HIV partial env)
While this is still running after 22:31:34 , I am surprised how slow this is.
the log file is attached for the entire run and for one alignment which I think is the bottleneck. one issue may be that I used HXB2 full genome (default) as ref.
as for the command line, I used

./vpipe --jobs 48 --printshellcmds

should I adjust n-core? I am running it from a server

#SBATCH --tasks-per-node=96
#SBATCH --mem=192G

alignment is only done for one of the 3 samples (took 8hrs??) and still far from finished for the other 2. logs also attached.

thoughts? should I modify the yaml? other params ? the refs?

thank you!
vpip.run.zip

@antoine4ucsd
Copy link
Author

after nearly 48 hrs, run failed with attached log. no haplotypes generated. I can be wrong but alignement step seems really long for such small region. there might also be problem with windows for shorah/viloca.
Also trying viloca separately but getting stuck as well. still working on the first sample after ~20 hrs.
for the latter, I used the below command line after mapping and sorting the fastq files with samtools/bwa and HIV full length env as ref.
I enforced the windows size which may not be appropriate...

viloca run -f "$REFERENCE" -b "$bam_file" -s 1 --windowsize 535 --win_coverage 100 --mode shorah

my final goal is to get full length haplotypes (~530bp for this project) and other metrics such as diversity,etc. I was able to generate haplotypes with cliqueSNV without issues.

all suggestions are welcome
have a nice end of 2024 and happy new year

vpipe-249574.out.txt

viloca-249586.out.txt

@LaraFuhrmann
Copy link
Collaborator

Hi Antoine,

thanks for your message. Looking at vpipe-249574.out.txt I found this line in the report: Good sequences (singletons file 2): 1,002 (0.57%) which indicates that <1% of reads are considered "good" by the quality control.
Have you had a look at the alignment file itself? Does it look good to you in terms of coverage and coverage distribution?

Regarding, VILOCA: The tool can only recover haplotypes of roughly read length, so no global haplotypes. If I understood correclty your reads are of length 150bp, hence it is not possible to recover haplotypes of lenght 535bps.
I would have expected CliqueSNV to provide those global haplotypes.

@antoine4ucsd
Copy link
Author

thank you Lara
Yes I saw the % of good sequences but does not seem right. Is there a way to overcome this?
the coverage is ok (>10k reads) for most samples apart from one highly variable more problematic region (~30 bp) that is not well coverage in ~50% of the samples.
I was hoping to run vpipe up to its end to get a sense of the data being generated

regarding the haplotype reconstruction, I get it. I thought that the 'global=TRUE' option would allow me to get those 'FL" haplotypes. I confirm CliqueSNV works but I like the option of ignoring region with coverage < xx (ideally replace by N). I also explored other option like haploclique or haploconduct with shorah. all suggestions are welcome.

@LaraFuhrmann
Copy link
Collaborator

Hi Antoine,

I am not an expert on the quality control step, maybe @DrYak you can give some input on that?

For the haplotype reconstruction, yes, 'global=TRUE' option allows you to run global haplotype reconstruction tools. However, none of the tools can guarantee the output of full-length haplotypes for any dataset. Global haplotype reconstruction from short-reads is still a complex problem.
For the masking of low-coverage regions, you could think of some post-processing step that masks low-coverage regions in your CliqueSNV haplotypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants