Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

giab-NA12878 truth_small_variants.vcf.gz header issues #1566

Closed
heuermh opened this issue Jun 15, 2017 · 4 comments
Closed

giab-NA12878 truth_small_variants.vcf.gz header issues #1566

heuermh opened this issue Jun 15, 2017 · 4 comments
Milestone

Comments

@heuermh
Copy link
Member

heuermh commented Jun 15, 2017

Two issues found in this validation VCF file: they use Type=String for the PS phase set VCF INFO field, and specify a BAD_PS phase set for GT VCF FORMAT field with Number=-1 cardinality restriction.

$ wget https://s3.amazonaws.com/biodata/collections/hg38/validation/giab-NA12878/truth_small_variants.vcf.gz

$ adam-submit transformVariants \
  truth_small_variants.vcf.gz \
  truth_small_variants.variants.adam

Command body threw exception:
java.lang.IllegalArgumentException: Field type for provided header line (FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT">) does not match supported line (FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set ID">)
Exception in thread "main" java.lang.IllegalArgumentException: Field type for provided header line (FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT">) does not match supported line (FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set ID">)
	at org.bdgenomics.adam.converters.VariantContextConverter$.org$bdgenomics$adam$converters$VariantContextConverter$$auditLine$1(VariantContextConverter.scala:172)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$cleanAndMixInSupportedLines$1$$anonfun$apply$3.apply(VariantContextConverter.scala:191)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$cleanAndMixInSupportedLines$1$$anonfun$apply$3.apply(VariantContextConverter.scala:190)
	at scala.Option.fold(Option.scala:158)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$cleanAndMixInSupportedLines$1.apply(VariantContextConverter.scala:190)
	at org.bdgenomics.adam.converters.VariantContextConverter$$anonfun$cleanAndMixInSupportedLines$1.apply(VariantContextConverter.scala:185)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
	at org.bdgenomics.adam.converters.VariantContextConverter$.cleanAndMixInSupportedLines(VariantContextConverter.scala:185)
	at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1.apply(ADAMContext.scala:1075)
	at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVcf$1.apply(ADAMContext.scala:1060)
	at scala.Option.fold(Option.scala:158)
	at org.apache.spark.rdd.Timer.time(Timer.scala:48)
	at org.bdgenomics.adam.rdd.ADAMContext.loadVcf(ADAMContext.scala:1060)
	at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:1730)
	at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:1728)
	at scala.Option.fold(Option.scala:158)
	at org.apache.spark.rdd.Timer.time(Timer.scala:48)
	at org.bdgenomics.adam.rdd.ADAMContext.loadVariants(ADAMContext.scala:1726)

$ adam-submit transformVariants \
  truth_small_variants.vcf.gz \
  truth_small_variants.variants.adam \
  -stringency LENIENT

2017-06-14 17:29:38 WARN  ADAMContext:175 - Field type for provided header line (FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set for GT">) does not match supported line (FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set ID">)
2017-06-14 17:29:39 WARN  VariantContextConverter:336 - Caught exception java.lang.NumberFormatException: For input string: "PATMAT" when converting [VC Unknown @ chr18:41959868 Q50.00 of type=SNP alleles=[G*, A] attr={callable=[CS_HiSeqPE300xGATK_callable, CS_CGnormal_callable, CS_HiSeqPE300xfreebayes_callable], callsetnames=[HiSeqPE300xGATK, CGnormal, HiSeqPE300xfreebayes, 10XGATKhaplo, SolidSE75GATKHC], callsets=5, datasetnames=[HiSeqPE300x, CGnormal, 10XChromium, SolidSE75bp], datasets=4, datasetsmissingcall=[IonExome, SolidPE50x50bp], platformnames=[Illumina, CG, 10X, Solid], platforms=4} GT=[[INTEGRATION A|G* GQ 1213 DP 885 AD 232,257 {ADALL=193,203, IGT=0/1, PS=PATMAT}]].
2017-06-14 17:29:39 WARN  VariantContextConverter:336 - Caught exception java.lang.NumberFormatException: For input string: "PATMAT" when converting [VC Unknown @ chr10:28062598 Q50.00 of type=SNP alleles=[T*, C] attr={callable=CS_HiSeqPE300xfreebayes_callable, callsetnames=[HiSeqPE300xfreebayes, HiSeqPE300xGATK, 10XGATKhaplo, SolidSE75GATKHC], callsets=4, datasetnames=[HiSeqPE300x, 10XChromium, SolidSE75bp], datasets=3, datasetsmissingcall=[CGnormal, IonExome, SolidPE50x50bp], platformnames=[Illumina, 10X, Solid], platforms=3} GT=[[INTEGRATION C|C GQ 337 DP 353 AD 0,0 {ADALL=0,181, IGT=1/1, PS=PATMAT}]].
...

$ adam-submit transformVariants \
  truth_small_variants.vcf.gz \
  truth_small_variants.variants.adam \
  -stringency SILENT

$ adam-shell

scala> import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.ADAMContext._

scala> val variants = sc.loadVariants("truth_small_variants.variants.adam")
warning: while trying to read VCF header from file received exception: htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Count < 0 for fixed size VCF header field BAD_PS
htsjdk.tribble.TribbleException: Input stream does not contain a BCF encoded file; BCF magic header info not found, at record 0 with position 0:
  at htsjdk.variant.bcf2.BCF2Codec.error(BCF2Codec.java:478)
  at htsjdk.variant.bcf2.BCF2Codec.readHeader(BCF2Codec.java:149)
  at org.seqdoop.hadoop_bam.util.VCFHeaderReader.readHeaderFrom(VCFHeaderReader.java:67)
  at org.bdgenomics.adam.rdd.ADAMContext.org$bdgenomics$adam$rdd$ADAMContext$$readVcfHeader(ADAMContext.scala:228)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadHeaderLines$1.apply(ADAMContext.scala:234)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadHeaderLines$1.apply(ADAMContext.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at org.bdgenomics.adam.rdd.ADAMContext.loadHeaderLines(ADAMContext.scala:234)
  at org.bdgenomics.adam.rdd.ADAMContext.loadParquetVariants(ADAMContext.scala:1175)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:1733)
  at org.bdgenomics.adam.rdd.ADAMContext$$anonfun$loadVariants$1.apply(ADAMContext.scala:1728)
  at scala.Option.fold(Option.scala:158)
  at org.apache.spark.rdd.Timer.time(Timer.scala:48)
  at org.bdgenomics.adam.rdd.ADAMContext.loadVariants(ADAMContext.scala:1726)
  ... 50 elided

$ cat truth_small_variants.variants.adam/_header 
##fileformat=VCFv4.2
##CL=vcffilter -i filtered-phase-transfer.vcf.gz -o - --javascript "ensureFormatHeader(\"##FORMAT=<ID=PS,Number=1,Type=String,Description=\\\"Phase set for GT\\\">\"); function record() {if(INTEGRATION.GT==\"1/1\") { INTEGRATION.IPS=\".\"; INTEGRATION.PS=\"HOMVAR\"; INTEGRATION.GT=\"1|1\";} else {if((INTEGRATION.GT==\"0/1\" || INTEGRATION.GT==\"1/2\" || INTEGRATION.GT==\"2/1\" || INTEGRATION.GT==\"1/0\") ) {if(INTEGRATION.IPS.length>1) {INTEGRATION.PS=INTEGRATION.IPS; INTEGRATION.GT=INTEGRATION.IGT;} else {INTEGRATION.PS=\".\";};} else { if((INTEGRATION.IPS.length<2)) { INTEGRATION.IPS=\".\";} INTEGRATION.PS=\"PATMAT\";};};}"
##FILTER=<ID=GQlessthan70,Description="Sum of GQ for datasets with this genotype less than 70">
##FILTER=<ID=alleleimbalance,Description="Filtered calls where the net allele balance for unfiltered datasets is <0.2 or >0.8">
##FILTER=<ID=allfilteredanddisagree,Description="All callsets have this call filtered or outside the callable regions and they have discordant genotypes or variant calls">
##FILTER=<ID=allfilteredbutagree,Description="All callsets have this call filtered or outside the callable regions but they have the same genotype">
##FILTER=<ID=cgonly,Description="Filtered calls where only Complete Genomics had this call and it was completely missing from any other callset">
##FILTER=<ID=discordanthet,Description="Filtered calls where a passing call is het and a high GQ but filtered call is hom var, since often the het is wrong">
##FILTER=<ID=discordantunfiltered,Description="Callsets with unfiltered calls have discordant genotypes or variant calls">
##FILTER=<ID=overlappingcall,Description="Filtered sites that are within 50bp of another passing call but none of the callsets that support the 2 calls match">
##FILTER=<ID=questionableindel,Description="Filtered calls where some callsets have a filtered indel larger than 10bp and another dataset has an implied homozygous reference call">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=ADALL,Number=R,Type=Integer,Description="Net allele depths across all datasets">
##FORMAT=<ID=BAD_PS,Number=-1,Type=String,Description="Phase set for GT">
...
@fnothaft
Copy link
Member

@heuermh is there anything actionable for us with this issue?

@heuermh
Copy link
Member Author

heuermh commented Jun 22, 2017

For the first error message, failing on Type=String for phase set PS is going to be common.

I was thinking of creating an issue to fix the second error message (tribble complaining about BCF), but that is actually an issue with Hadoop-BAM HadoopGenomics/Hadoop-BAM#132

@fnothaft
Copy link
Member

Oh, I see. Sorry, I misread this issue as their file had PS=Int in the header but PS=String in the VCF lines. Ignore me.

@heuermh
Copy link
Member Author

heuermh commented Aug 22, 2017

The BAD_PS phase set for GT VCF FORMAT field with Number=-1 cardinality restriction header line comes from ADAM. This is the actionable issue here.

https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/converters/VariantContextConverter.scala#L180

Grr... looks like we need to switch on countType, if it is INTEGER we need to use the ctr with int count instead.

See
https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/vcf/VCFCompoundHeaderLine.java#L54

https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/vcf/VCFCompoundHeaderLine.java#L124

Note this.count is not set here
https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/variant/vcf/VCFCompoundHeaderLine.java#L144

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants