Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Condition barcode count summary report #31

Closed
wants to merge 12 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# Changelog

## 3.10.0
* Machine-parseable condition barcode summary file

## 3.9.0
* Use sampling technique for generating unexpected sequence reports

Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
fullversion := $(shell grep -m 1 'ThisBuild / version :=' ./version.sbt | perl -pe 's/^ThisBuild \/ version := "([0-9]+\.[0-9+]\.[0-9]+).*$$/$$1/g')
fullversion := $(shell grep -m 1 'ThisBuild / version :=' ./version.sbt | perl -pe 's/^ThisBuild \/ version := "([0-9]+\.[0-9]+\.[0-9]+).*$$/$$1/g')

version := $(shell grep -m 1 'ThisBuild / version :=' ./version.sbt | perl -pe 's/^ThisBuild \/ version := "([0-9]+\.[0-9+]).*$$/$$1/g')

Expand Down
24 changes: 12 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# PoolQ 3.0

Copyright (c) 2022 Genetic Perturbation Platform, The Broad Institute of Harvard and MIT.
Copyright (c) 2024 Genetic Perturbation Platform, The Broad Institute of Harvard and MIT.

[![Build Status](https://github.com/broadinstitute/poolq/actions/workflows/ci.yml/badge.svg)](https://github.com/broadinstitute/poolq/actions/workflows/ci.yml)

## Overview

PoolQ is a counter for indexed samples from next-generation sequencing of pooled DNA. Given a set
of sequencing data files (FASTQ, SAM, or BAM), and a pair of reference files mapping DNA barcodes
to construct or experimental identifiers, PoolQ reads the sequencing data and tallies the
to construct or experimental identifiers, PoolQ reads the sequencing data and tallies the
co-occurrence of each pair of barcodes from the two files, yielding a two-dimensional histogram.
The barcodes in one reference file are treated as rows in the histogram; the other correspond to
The barcodes in one reference file are treated as rows in the histogram; the other correspond to
columns.

PoolQ is capable of locating barcodes within reads using a variety of techniques:
Expand All @@ -22,16 +22,16 @@ It matches barcodes to reference data either exactly or allowing up to one base
PoolQ does not support matching with gaps or deletions.

In addition to producing a histogram, PoolQ generates a number of reports, which contain statistics and
other information that can be used to troubleshoot experiments. These include match percentages, barcode
other information that can be used to troubleshoot experiments. These include match percentages, barcode
locations, matching correlations between barcodes, and lists of frequently-occurring unknown barcodes.

## Documentation
For information on how to run PoolQ and its various modes and options, please see the
For information on how to run PoolQ and its various modes and options, please see the
[manual](docs/MANUAL.md). We also maintain a [changelog](CHANGELOG.md) listing updates made to PoolQ.

As of version 3.5.0, the source code to PoolQ is available under a [BSD 3-clause license](LICENSE). We
As of version 3.5.0, the source code to PoolQ is available under a [BSD 3-clause license](LICENSE). We
welcome contributions to PoolQ and have created a [contributor guide](CONTRIBUTING.md). Additionally,
we maintain a [list](NOTICE.txt) of other open-source libraries PoolQ depends on, along with links to
we maintain a [list](NOTICE.txt) of other open-source libraries PoolQ depends on, along with links to
associated licenses.

## Changes in PoolQ 3
Expand All @@ -40,7 +40,7 @@ PoolQ was completely rewritten for version 3. The new code is faster and the cod
and more maintainable. We have taken the opportunity to make other changes to PoolQ as well.

* There are substantial changes to the command-line interface for the program.
* The default counts file format has changed slightly, although there is a command-line
* The default counts file format has changed slightly, although there is a command-line
argument that indicates that PoolQ 3 should write a backwards-compatible counts file. The differences
are in headers only; file parsers should be able to adapt easily.
* The quality file has changed somewhat. Importantly, the definition of certain statistics has changed
Expand All @@ -51,14 +51,14 @@ See the [manual](docs/MANUAL.md) for complete details on the differences version

## PoolQ 2 support

We will continue to make the PoolQ 2.4 artifacts available for download on the
We will continue to make the PoolQ 2.4 artifacts available for download on the
[GPP portal](https://portals.broadinstitute.org/gpp/public/software/poolq). We have no plans to add
features to the code. We will address bugs on a case-by-case basis; in general only critical
features to the code. We will address bugs on a case-by-case basis; in general only critical
bugfixes will be ported to versions prior to 2.4, effective immediately.

## Maintainers
## Maintainers

PoolQ was originally developed by John Sullivan and Shuba Gopal of the Broad Institute RNAi Platform. It
PoolQ was originally developed by John Sullivan and Shuba Gopal of the Broad Institute RNAi Platform. It
is maintained by Mark Tomko of the Broad Institute Genetic Perturbation Platform.

## Contact Us
Expand Down
8 changes: 5 additions & 3 deletions docs/MANUAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

PoolQ is a counter for indexed samples from next-gen sequencing of pooled DNA.

_This documentation covers PoolQ version 3.7.0 (last updated 09/05/2023)._
_This documentation covers PoolQ version 3.10.0 (last updated 02/08/2024)._

## Background

Expand Down Expand Up @@ -559,7 +559,7 @@ PoolQ you will need a Java 8 JDK. You can download an appropriate JRE or JDK fro
You can download PoolQ from an as yet undetermined location. The file you download is a ZIP file
that you will need to unzip. In most cases, this is as simple as right-clicking on the zip file, and
selecting something like "extract contents" from the popup menu. This will create a new folder on
your computer named `poolq-3.7.0`, with the following contents:
your computer named `poolq-3.10.0`, with the following contents:

- `poolq3.jar`
- `poolq3.bat`
Expand Down Expand Up @@ -627,7 +627,7 @@ how to launch programs from the command line on your given operating system.
If you successfully launched PoolQ, you should see a usage message explaining all of the
command-line options:

poolq3 3.7.0
poolq3 3.10.0
Usage: poolq [options]

--row-reference <file> reference file for row barcodes (i.e., constructs)
Expand All @@ -652,6 +652,7 @@ command-line options:
--umi-counts-dir <file>
--umi-barcode-counts-dir <file>
--quality <file>
--condition-barcode-counts-summary <file>
--counts <file>
--normalized-counts <file>
--barcode-counts <file>
Expand All @@ -661,6 +662,7 @@ command-line options:
--correlation <file>
--run-info <file>
--unexpected-sequence-threshold <number>
--unexpected-sequence-sample-pct <pct>
--unexpected-sequences <file>
--umi-quality <file>
--unexpected-sequence-cache <cache-dir>
Expand Down
19 changes: 17 additions & 2 deletions src/main/scala/org/broadinstitute/gpp/poolq3/PoolQ.scala
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ import org.broadinstitute.gpp.poolq3.reports.{
}
import org.broadinstitute.gpp.poolq3.types.{
BarcodeCountsFileType,
ConditionBarcodeCountsSummaryFileType,
CountsFileType,
LogNormalizedCountsFileType,
OutputFileType,
Expand All @@ -49,7 +50,14 @@ object PoolQ {
private[this] val log: Logger = getLogger

private[this] val AlwaysWrittenFiles: Set[OutputFileType] =
Set(CountsFileType, QualityFileType, LogNormalizedCountsFileType, BarcodeCountsFileType, RunInfoFileType)
Set(
CountsFileType,
QualityFileType,
ConditionBarcodeCountsSummaryFileType,
LogNormalizedCountsFileType,
BarcodeCountsFileType,
RunInfoFileType
)

final def main(args: Array[String]): Unit =
PoolQConfig.parse(args) match {
Expand Down Expand Up @@ -169,7 +177,14 @@ object PoolQ {
config.reportsDialect
)
_ = log.info(s"Writing quality file ${config.output.qualityFile}")
_ <- QualityWriter.write(config.output.qualityFile, state, rowReference, colReference, config.isPairedEnd)
_ <- QualityWriter.write(
config.output.qualityFile,
config.output.conditionBarcodeCountsSummaryFile,
state,
rowReference,
colReference,
config.isPairedEnd
)
_ <- umiInfo.fold(().pure[Try])(_ => UmiQualityWriter.write(config.output.umiQualityFile, state))
_ = log.info(s"Writing log-normalized counts file ${config.output.normalizedCountsFile}")
normalizedCounts = LogNormalizedCountsWriter.logNormalizedCounts(counts, rowReference, colReference)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ final case class PoolQOutput(
normalizedCountsFile: Path = Paths.get("lognormalized-counts.txt"),
barcodeCountsFile: Path = Paths.get("barcode-counts.txt"),
qualityFile: Path = Paths.get("quality.txt"),
conditionBarcodeCountsSummaryFile: Path = Paths.get("condition-barcode-counts-summary.txt"),
correlationFile: Path = Paths.get("correlation.txt"),
unexpectedSequencesFile: Path = Paths.get("unexpected-sequences.txt"),
umiQualityFile: Path = Paths.get("umi-quality.txt"),
Expand Down Expand Up @@ -253,6 +254,11 @@ object PoolQConfig {
val _ =
opt[Path]("quality").valueName("<file>").action((f, c) => c.copy(output = c.output.copy(qualityFile = f)))

val _ =
opt[Path]("condition-barcode-counts-summary")
.valueName("<file>")
.action((f, c) => c.copy(output = c.output.copy(conditionBarcodeCountsSummaryFile = f)))

val _ = opt[Path]("counts").valueName("<file>").action((f, c) => c.copy(output = c.output.copy(countsFile = f)))

val _ = opt[Path]("normalized-counts").valueName("<file>").action { (f, c) =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,17 +16,39 @@ import org.broadinstitute.gpp.poolq3.reference.Reference

object QualityWriter {

class TeeWriter(w1: PrintWriter, w2: PrintWriter) {

def print(s: String): Unit = {
w1.print(s)
w2.print(s)
}

def println(s: String): Unit = {
w1.println(s)
w2.println(s)
}

def println(): Unit = {
w1.println()
w2.println()
}

}

def write(
file: Path,
qualityFile: Path,
conditionBarcodeCountsSummaryFile: Path,
state: State,
rowReference: Reference,
colReference: Reference,
isPairedEnd: Boolean
): Try[Unit] =
Using(new PrintWriter(file.toFile)) { writer =>
val barcodeLocationStats =
if (isPairedEnd) {
s"""Reads with no construct barcode: ${state.rowBarcodeNotFound + state.revRowBarcodeNotFound - state.neitherRowBarcodeFound}
Try {
Using.resources(new PrintWriter(qualityFile.toFile), new PrintWriter(conditionBarcodeCountsSummaryFile.toFile)) {
case (qualityWriter, cbcsWriter) =>
val barcodeLocationStats =
if (isPairedEnd) {
s"""Reads with no construct barcode: ${state.rowBarcodeNotFound + state.revRowBarcodeNotFound - state.neitherRowBarcodeFound}
|
|Reads with no forward construct barcode: ${state.rowBarcodeNotFound}
|Max forward construct barcode index: ${state.rowBarcodeStats.maxPosStr}
Expand All @@ -38,15 +60,15 @@ object QualityWriter {
|Min reverse construct barcode index: ${state.revRowBarcodeStats.minPosStr}
|Avg reverse construct barcode index: ${decOptFmt(state.revRowBarcodeStats.avg)}""".stripMargin

} else {
s"""Reads with no construct barcode: ${state.rowBarcodeNotFound}
} else {
s"""Reads with no construct barcode: ${state.rowBarcodeNotFound}
|Max construct barcode index: ${state.rowBarcodeStats.maxPosStr}
|Min construct barcode index: ${state.rowBarcodeStats.minPosStr}
|Avg construct barcode index: ${decOptFmt(state.rowBarcodeStats.avg)}""".stripMargin
}
}

val header =
s"""Total reads: ${state.reads}
val header =
s"""Total reads: ${state.reads}
|Matching reads: ${state.matches}
|1-base mismatch reads: ${state.matches - state.exactMatches}
|
Expand All @@ -55,25 +77,29 @@ object QualityWriter {
|$barcodeLocationStats
|""".stripMargin

writer.println(header)
qualityWriter.println(header)

writer.println(s"Read counts for sample barcodes with associated conditions:")
writer.println(
s"Barcode\tCondition\tMatched (Construct+Sample Barcode)\tMatched Sample Barcode\t% Match\tNormalized Match"
)
colReference.allBarcodes.foreach { colBarcode =>
val data = perBarcodeQualityData(state, rowReference, colReference, colBarcode)
writer.println(data.mkString("\t"))
}
qualityWriter.println(s"Read counts for sample barcodes with associated conditions:")

// use a TeeWriter for the next section of the report
val tw = new TeeWriter(qualityWriter, cbcsWriter)
tw.println(
s"Barcode\tCondition\tMatched (Construct+Sample Barcode)\tMatched Sample Barcode\t% Match\tNormalized Match"
)
colReference.allBarcodes.foreach { colBarcode =>
val data = perBarcodeQualityData(state, rowReference, colReference, colBarcode)
tw.println(data.mkString("\t"))
}

writer.println()
writer.println("Read counts for most common sample barcodes without associated conditions:")
val unepectedBarcodeFrequencies =
state.unknownCol.keys.map(barcode => BarcodeFrequency(barcode, state.unknownCol.count(barcode))).toSeq
topN(unepectedBarcodeFrequencies, 100).foreach { case BarcodeFrequency(barcode, count) =>
writer.println(barcode + "\t" + count.toString)
qualityWriter.println()
qualityWriter.println("Read counts for most common sample barcodes without associated conditions:")
val unepectedBarcodeFrequencies =
state.unknownCol.keys.map(barcode => BarcodeFrequency(barcode, state.unknownCol.count(barcode))).toSeq
topN(unepectedBarcodeFrequencies, 100).foreach { case BarcodeFrequency(barcode, count) =>
qualityWriter.println(barcode + "\t" + count.toString)
}
qualityWriter.println()
}
writer.println()
}

private[this] def decOptFmt(d: Option[Double]): String = d.map(Decimal00Format.format).getOrElse("N/A")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ package org.broadinstitute.gpp.poolq3.types
trait OutputFileType extends Product with Serializable
case object CountsFileType extends OutputFileType
case object QualityFileType extends OutputFileType
case object ConditionBarcodeCountsSummaryFileType extends OutputFileType
case object LogNormalizedCountsFileType extends OutputFileType
case object BarcodeCountsFileType extends OutputFileType
case object CorrelationFileType extends OutputFileType
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@
*/
package org.broadinstitute.gpp.poolq3.types

case class PoolQSummary(runSummary: PoolQRunSummary, outputFiles: Set[OutputFileType])
final case class PoolQSummary(runSummary: PoolQRunSummary, outputFiles: Set[OutputFileType])
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ class UnlabeledConditionsTest extends CatsEffectSuite with TestResources {
barcodeCountsFile <- tempFile[IO]("barcode-counts", ".txt")
normalizedCountsFile <- tempFile[IO]("normcounts", ".txt")
qualityFile <- tempFile[IO]("quality", ".txt")
conditionBarcodeCountsSummaryFile <- tempFile[IO]("condition-barcode-counts-summary", ".txt")
correlationFile <- tempFile[IO]("correlation", ".txt")
unexpectedSequencesFile <- tempFile[IO]("unexpected", ".txt")
runInfoFile <- tempFile[IO]("runinfo", ".txt")
Expand All @@ -32,6 +33,7 @@ class UnlabeledConditionsTest extends CatsEffectSuite with TestResources {
normalizedCountsFile = normalizedCountsFile,
barcodeCountsFile = barcodeCountsFile,
qualityFile = qualityFile,
conditionBarcodeCountsSummaryFile = conditionBarcodeCountsSummaryFile,
correlationFile = correlationFile,
unexpectedSequencesFile = unexpectedSequencesFile,
runInfoFile = runInfoFile
Expand Down
Loading
Loading