-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filtering step runs and stores result #40
Conversation
…using multiple filters
filtered_data = list( | ||
filter_step_run = FALSE, #adding a way to know if the filter step was run since it's optional | ||
metadata_pg_ids = NULL, | ||
pg_metadata = NULL, | ||
transformed_log2_cpm = NULL | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of description (a)
@@ -36,7 +36,12 @@ setup_data <- function(counts = NULL, | |||
pg_metadata = NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was still NULL after I ran through setup, qc, and filtering steps. Unnecessary because the info that would normally be associated with it is handled in the annotate function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we could probably remove it. I don't think it ends up being used???
filter_plasmid_target_col = filter_plasmid_target_col, | ||
filter_replicates_target_col = filter_replicates_target_col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This corresponds to description (b)
p_filter <- qc_filter_plasmid(gimap_dataset, cutoff = cutoff, filter_plasmid_target_col = filter_plasmid_target_col)$plasmid_filter | ||
p_filter <- qc_filter_plasmid(gimap_dataset, cutoff = cutoff, filter_plasmid_target_col = filter_plasmid_target_col)$filter | ||
} else if (filter_type == "zero_count_only"){ | ||
zc_filter <- qc_filter_zerocounts(gimap_dataset, filter_zerocount_target_col = filter_zerocount_target_col)$filter | ||
} else if(filter_type == "low_plasmid_cpm_only"){ | ||
p_filter <- qc_filter_plasmid(gimap_dataset, cutoff = cutoff, filter_plasmid_target_col = filter_plasmid_target_col)$plasmid_filter | ||
p_filter <- qc_filter_plasmid(gimap_dataset, cutoff = cutoff, filter_plasmid_target_col = filter_plasmid_target_col)$filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of description (c)
#' @return a named list with the filter `plasmid_filter` specifying which pgRNAs have low plasmid log2 CPM (column of interest is `plasmid_cpm_filter`) and a report df `plasmid_filter_report` for the number and percent of pgRNA which have a low plasmid log2 CPM | ||
#' @return a named list with the filter `filter` specifying which pgRNAs have low plasmid log2 CPM (column of interest is `plasmid_cpm_filter`) and a report df `reportdf` for the number and percent of pgRNA which have a low plasmid log2 CPM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of description (c)
plasmid_filter = plasmid_cpm_filter, | ||
plasmid_filter_report = plasmid_filter_df | ||
filter = plasmid_cpm_filter, | ||
reportdf = plasmid_filter_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of description (c)
inst/rmd/gimapQCTemplate.Rmd
Outdated
@@ -65,7 +66,7 @@ qc_variance_hist(gimap_dataset, filter_replicates_target_col = filter_replicates | |||
## Plasmid expression (log 2 cpm values for Day 0 sample/time point) | |||
|
|||
```{r} | |||
qc_plasmid_histogram(gimap_dataset) | |||
qc_plasmid_histogram(gimap_dataset, filter_plasmid_target_col = filter_plasmid_target_col) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of description (d)
@@ -71,10 +75,13 @@ gimap_filter <- function(.data = NULL, | |||
#then it finds the row sum (how many are filters flagged each construct e.g., number of TRUE in each row), | |||
#and finally compares the row sum to the `min_n_filters` parameter to report TRUEs and FALSEs according to whether each construct is flagged by the minimum number of required filters | |||
#TRUE means it should be filtered, FALSE means it shouldn't be filtered | |||
combined_filter <- rowSums(reduce(possible_filters, cbind)) >= min_n_filters | |||
|
|||
combined_filter <- rowSums(reduce(possible_filters, cbind)) >= min_n_filters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This corresponds to description (e)
gimap_dataset$filtered <- NULL #TODO: Filtered version of the data can be stored here | ||
gimap_dataset$filtered_data$filter_step_run <- TRUE #adding a way to know if the filter step was run since it's optional | ||
gimap_dataset$filtered_data$metadata_pg_ids <- gimap_dataset$metadata$pg_ids[!combined_filter,] | ||
gimap_dataset$filtered_data$pg_metadata <- gimap_dataset$metadata$pg_metadata[!combined_filter,] | ||
gimap_dataset$filtered_data$transformed_log2_cpm <- gimap_dataset$transformed_data$log2_cpm[!combined_filter,] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This corresponds to description (f)
qc_filter_zerocounts(gimap_dataset, filter_zerocount_target_col = filter_zerocount_target_col)$reportdf | ||
potentialFilter1 <- qc_filter_zerocounts(gimap_dataset, filter_zerocount_target_col = filter_zerocount_target_col) | ||
|
||
potentialFilter1$reportdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of description (g)
inst/rmd/gimapQCTemplate.Rmd
Outdated
qc_filter_plasmid(gimap_dataset)$plasmid_filter_report | ||
potentialFilter2 <- qc_filter_plasmid(gimap_dataset, filter_plasmid_target_col = filter_plasmid_target_col) | ||
|
||
potentialFilter2$reportdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of description (g)
### If both filters are applied | ||
|
||
```{r} | ||
combined_filters <- reduce(list(potentialFilter1$filter, potentialFilter2$filter), cbind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of description (g) "but then combined the potential filters themselves later on"
| Which Filter(s) | Number of pgRNAs flagged for removal | Percent of total pgRNA constructs | | ||
|:---------------|:------------------------------------|:----------| | ||
| Zero count, but not low plasmid CPM | `r sum(combined_filters[,1] == TRUE & combined_filters[,2] == FALSE)` | `r round(sum(combined_filters[,1] == TRUE & combined_filters[,2] == FALSE) / nrow(combined_filters) * 100, 2)`| | ||
| Low plasmid CPM, but not zero count | `r sum(combined_filters[,2] == TRUE & combined_filters[,1] == FALSE)` | `r round(sum(combined_filters[,2] == TRUE & combined_filters[,1] == FALSE) / nrow(combined_filters) * 100, 2)` | | ||
| Either Zero count or Low plasmid CPM or both | `r sum(rowSums(combined_filters) >= 1)`| `r round(sum(rowSums(combined_filters) >= 1) / nrow(combined_filters) * 100, 2)` | | ||
| Both Zero count and Low plasmid CPM | `r sum(rowSums(combined_filters) == 2)` | `r round(sum(rowSums(combined_filters) == 2) / nrow(combined_filters) * 100, 2)` | | ||
| Remaining pgRNAs flagged by no filters | `r sum(rowSums(combined_filters) == 0)` | `r round(sum(rowSums(combined_filters) == 0) / nrow(combined_filters) * 100, 2)` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This corresponds to description (h)
```{r} | ||
gimap_dataset <- gimap_dataset %>% | ||
gimap_filter() | ||
|
||
nrow(gimap_dataset$filtered_data$transformed_log2_cpm) | ||
|
||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This corresponds to description (i)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this all is sensible. I'll just need to update #26 to use the filtered data
``` | ||
|
||
| Which Filter(s) | Number of pgRNAs flagged for removal | Percent of total pgRNA constructs | | ||
|:---------------|:------------------------------------|:----------| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love this table.
|
||
gimap_dataset$filtered <- NULL #TODO: Filtered version of the data can be stored here | ||
gimap_dataset$filtered_data$filter_step_run <- TRUE #adding a way to know if the filter step was run since it's optional | ||
gimap_dataset$filtered_data$metadata_pg_ids <- gimap_dataset$metadata$pg_ids[!combined_filter,] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to myself, I'm going to have to go change the next steps of this code (the normalization and calc_crispr steps to use the filtered data instead.)
As of this PR: The filtering steps can now be run all together and filtering stores results of filtering; also includes another df in report about using multiple filters
To make these changes, I've:
(a) I've added a list for
gimap_dataset$filtered_data
withinR/00-setup_data.R
so that we store the filtered data, metadata, etc. I'll note that in this section, I've included-- a boolean since the filtering step is optional so later steps can see if the filtering step was run/should use that data.
-- the transformed log2 CPM values (filtered)
-- the pgRNA IDs metadata (filtered)
-- the pgRNA general metadata (filtered) Note: noticed that this value was NULL, and wondering if this variable isn't needed because it's not really referenced in the code before this, it's NULL in the example gimap_dataset we load from gimap, and the annotation step seems to be grabbing its own metadata?
(b) Fixed a bug in
R/01-qc.R
where I passed the wrong parameter names to the.Rmd
template.(c) Renamed the output in the list from the
qc_filter_plasmid()
function to match the other filter function (filter
andreportdf
) and changed how I call them inR/02-filter.R
and the template QC.Rmd
(d) Reverified/added target column selection parameters within the QC
.Rmd
as necessary(e) Included finding a consensus filter of the possible filters (that works no matter how many filters the user wants)
(f) Stores the results following the consensus filter in the appropriate
gimap_dataset$filtered_data
list items (referenced in (a))(g) Changed how the filter report df's are shown in the QC Template Rmd because wanted to be able to report number that would be filtered if combination of filters were used. So stored the full filtering output for each and called the reportdf immediately to display it, but then combined the potential filters themselves later on.
(h) Included a table with inline code to report number and percent or overlap of pgRNA constructs that would be filtered given various combinations of the filters (to better help the user make informed decisions on which filters they want to use)
(i) finally, added the filter step to the vignette and rendered it locally to make sure that the steps ran without error and I had the expected number of pgRNA constructs left after filtering.
Requested Review:
The requested review for this is:
(1) a verification that it works for you too
(2) is the documentation sufficient outside of the vignette? I know documentation needs to be added to the vignette, but want to wait until we finalize this code/these steps
(3) Does the setup/returned product work with next steps? (e.g., Do we need other data to be filtered/stored (like count_norm) or just log2_cpm? I looked at the old code and it seems like log2_cpm is the main workhorse).
(4) Where/when should I output a report of pgRNA constructs which have a raw count of 0 for all replicates? I'm thinking that's just good information to have whether those constructs are filtered or not, and so I can output that to another file when the bar plot is plotted in the QC template.
(5) Good next step to output a list of the pgRNA IDs that are filtered out? If so, where should that file live?
(6) Any other spots that I haven't notated throughout these stacked PRs that could use checking user input for validation?
Thank you! And please let me know if you have any questions or I can make something clearer. I noticed in a PR that Howard left inline comments explaining changes and that was really helpful for me, so I'll do that too connecting them to my lettered list above