Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering step runs and stores result #40

Merged
merged 2 commits into from
Jul 3, 2024
Merged

Filtering step runs and stores result #40

merged 2 commits into from
Jul 3, 2024

Conversation

kweav
Copy link
Collaborator

@kweav kweav commented Jun 27, 2024

⚠️ Please look at PR #41 first since this is based on that ⚠️

As of this PR: The filtering steps can now be run all together and filtering stores results of filtering; also includes another df in report about using multiple filters

To make these changes, I've:

(a) I've added a list for gimap_dataset$filtered_data within R/00-setup_data.R so that we store the filtered data, metadata, etc. I'll note that in this section, I've included
-- a boolean since the filtering step is optional so later steps can see if the filtering step was run/should use that data.
-- the transformed log2 CPM values (filtered)
-- the pgRNA IDs metadata (filtered)
-- the pgRNA general metadata (filtered) Note: noticed that this value was NULL, and wondering if this variable isn't needed because it's not really referenced in the code before this, it's NULL in the example gimap_dataset we load from gimap, and the annotation step seems to be grabbing its own metadata?

(b) Fixed a bug in R/01-qc.R where I passed the wrong parameter names to the .Rmd template.

(c) Renamed the output in the list from the qc_filter_plasmid() function to match the other filter function (filter and reportdf) and changed how I call them in R/02-filter.R and the template QC .Rmd

(d) Reverified/added target column selection parameters within the QC .Rmd as necessary

(e) Included finding a consensus filter of the possible filters (that works no matter how many filters the user wants)

(f) Stores the results following the consensus filter in the appropriate gimap_dataset$filtered_data list items (referenced in (a))

(g) Changed how the filter report df's are shown in the QC Template Rmd because wanted to be able to report number that would be filtered if combination of filters were used. So stored the full filtering output for each and called the reportdf immediately to display it, but then combined the potential filters themselves later on.

(h) Included a table with inline code to report number and percent or overlap of pgRNA constructs that would be filtered given various combinations of the filters (to better help the user make informed decisions on which filters they want to use)

(i) finally, added the filter step to the vignette and rendered it locally to make sure that the steps ran without error and I had the expected number of pgRNA constructs left after filtering.

Requested Review:

The requested review for this is:

(1) a verification that it works for you too
(2) is the documentation sufficient outside of the vignette? I know documentation needs to be added to the vignette, but want to wait until we finalize this code/these steps
(3) Does the setup/returned product work with next steps? (e.g., Do we need other data to be filtered/stored (like count_norm) or just log2_cpm? I looked at the old code and it seems like log2_cpm is the main workhorse).
(4) Where/when should I output a report of pgRNA constructs which have a raw count of 0 for all replicates? I'm thinking that's just good information to have whether those constructs are filtered or not, and so I can output that to another file when the bar plot is plotted in the QC template.
(5) Good next step to output a list of the pgRNA IDs that are filtered out? If so, where should that file live?
(6) Any other spots that I haven't notated throughout these stacked PRs that could use checking user input for validation?

Thank you! And please let me know if you have any questions or I can make something clearer. I noticed in a PR that Howard left inline comments explaining changes and that was really helpful for me, so I'll do that too connecting them to my lettered list above

Comment on lines +39 to +44
filtered_data = list(
filter_step_run = FALSE, #adding a way to know if the filter step was run since it's optional
metadata_pg_ids = NULL,
pg_metadata = NULL,
transformed_log2_cpm = NULL
),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of description (a)

@@ -36,7 +36,12 @@ setup_data <- function(counts = NULL,
pg_metadata = NULL,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was still NULL after I ran through setup, qc, and filtering steps. Unnecessary because the info that would normally be associated with it is handled in the annotate function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we could probably remove it. I don't think it ends up being used???

Comment on lines +67 to +68
filter_plasmid_target_col = filter_plasmid_target_col,
filter_replicates_target_col = filter_replicates_target_col
Copy link
Collaborator Author

@kweav kweav Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This corresponds to description (b)

Comment on lines -59 to +67
p_filter <- qc_filter_plasmid(gimap_dataset, cutoff = cutoff, filter_plasmid_target_col = filter_plasmid_target_col)$plasmid_filter
p_filter <- qc_filter_plasmid(gimap_dataset, cutoff = cutoff, filter_plasmid_target_col = filter_plasmid_target_col)$filter
} else if (filter_type == "zero_count_only"){
zc_filter <- qc_filter_zerocounts(gimap_dataset, filter_zerocount_target_col = filter_zerocount_target_col)$filter
} else if(filter_type == "low_plasmid_cpm_only"){
p_filter <- qc_filter_plasmid(gimap_dataset, cutoff = cutoff, filter_plasmid_target_col = filter_plasmid_target_col)$plasmid_filter
p_filter <- qc_filter_plasmid(gimap_dataset, cutoff = cutoff, filter_plasmid_target_col = filter_plasmid_target_col)$filter
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of description (c)

Comment on lines -126 to +134
#' @return a named list with the filter `plasmid_filter` specifying which pgRNAs have low plasmid log2 CPM (column of interest is `plasmid_cpm_filter`) and a report df `plasmid_filter_report` for the number and percent of pgRNA which have a low plasmid log2 CPM
#' @return a named list with the filter `filter` specifying which pgRNAs have low plasmid log2 CPM (column of interest is `plasmid_cpm_filter`) and a report df `reportdf` for the number and percent of pgRNA which have a low plasmid log2 CPM
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of description (c)

Comment on lines -187 to +196
plasmid_filter = plasmid_cpm_filter,
plasmid_filter_report = plasmid_filter_df
filter = plasmid_cpm_filter,
reportdf = plasmid_filter_df
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of description (c)

@@ -65,7 +66,7 @@ qc_variance_hist(gimap_dataset, filter_replicates_target_col = filter_replicates
## Plasmid expression (log 2 cpm values for Day 0 sample/time point)

```{r}
qc_plasmid_histogram(gimap_dataset)
qc_plasmid_histogram(gimap_dataset, filter_plasmid_target_col = filter_plasmid_target_col)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of description (d)

@@ -71,10 +75,13 @@ gimap_filter <- function(.data = NULL,
#then it finds the row sum (how many are filters flagged each construct e.g., number of TRUE in each row),
#and finally compares the row sum to the `min_n_filters` parameter to report TRUEs and FALSEs according to whether each construct is flagged by the minimum number of required filters
#TRUE means it should be filtered, FALSE means it shouldn't be filtered
combined_filter <- rowSums(reduce(possible_filters, cbind)) >= min_n_filters

combined_filter <- rowSums(reduce(possible_filters, cbind)) >= min_n_filters
Copy link
Collaborator Author

@kweav kweav Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This corresponds to description (e)

Comment on lines -77 to +84
gimap_dataset$filtered <- NULL #TODO: Filtered version of the data can be stored here
gimap_dataset$filtered_data$filter_step_run <- TRUE #adding a way to know if the filter step was run since it's optional
gimap_dataset$filtered_data$metadata_pg_ids <- gimap_dataset$metadata$pg_ids[!combined_filter,]
gimap_dataset$filtered_data$pg_metadata <- gimap_dataset$metadata$pg_metadata[!combined_filter,]
gimap_dataset$filtered_data$transformed_log2_cpm <- gimap_dataset$transformed_data$log2_cpm[!combined_filter,]
Copy link
Collaborator Author

@kweav kweav Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This corresponds to description (f)

Comment on lines -84 to +87
qc_filter_zerocounts(gimap_dataset, filter_zerocount_target_col = filter_zerocount_target_col)$reportdf
potentialFilter1 <- qc_filter_zerocounts(gimap_dataset, filter_zerocount_target_col = filter_zerocount_target_col)

potentialFilter1$reportdf
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of description (g)

Comment on lines 92 to 97
qc_filter_plasmid(gimap_dataset)$plasmid_filter_report
potentialFilter2 <- qc_filter_plasmid(gimap_dataset, filter_plasmid_target_col = filter_plasmid_target_col)

potentialFilter2$reportdf
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of description (g)

### If both filters are applied

```{r}
combined_filters <- reduce(list(potentialFilter1$filter, potentialFilter2$filter), cbind)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of description (g) "but then combined the potential filters themselves later on"

Comment on lines +106 to +112
| Which Filter(s) | Number of pgRNAs flagged for removal | Percent of total pgRNA constructs |
|:---------------|:------------------------------------|:----------|
| Zero count, but not low plasmid CPM | `r sum(combined_filters[,1] == TRUE & combined_filters[,2] == FALSE)` | `r round(sum(combined_filters[,1] == TRUE & combined_filters[,2] == FALSE) / nrow(combined_filters) * 100, 2)`|
| Low plasmid CPM, but not zero count | `r sum(combined_filters[,2] == TRUE & combined_filters[,1] == FALSE)` | `r round(sum(combined_filters[,2] == TRUE & combined_filters[,1] == FALSE) / nrow(combined_filters) * 100, 2)` |
| Either Zero count or Low plasmid CPM or both | `r sum(rowSums(combined_filters) >= 1)`| `r round(sum(rowSums(combined_filters) >= 1) / nrow(combined_filters) * 100, 2)` |
| Both Zero count and Low plasmid CPM | `r sum(rowSums(combined_filters) == 2)` | `r round(sum(rowSums(combined_filters) == 2) / nrow(combined_filters) * 100, 2)` |
| Remaining pgRNAs flagged by no filters | `r sum(rowSums(combined_filters) == 0)` | `r round(sum(rowSums(combined_filters) == 0) / nrow(combined_filters) * 100, 2)` |
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This corresponds to description (h)

Comment on lines +161 to +168
```{r}
gimap_dataset <- gimap_dataset %>%
gimap_filter()

nrow(gimap_dataset$filtered_data$transformed_log2_cpm)

```

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This corresponds to description (i)

@kweav kweav requested a review from cansavvy June 27, 2024 19:41
Base automatically changed from filter_ki to main July 3, 2024 18:06
Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this all is sensible. I'll just need to update #26 to use the filtered data

```

| Which Filter(s) | Number of pgRNAs flagged for removal | Percent of total pgRNA constructs |
|:---------------|:------------------------------------|:----------|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this table.


gimap_dataset$filtered <- NULL #TODO: Filtered version of the data can be stored here
gimap_dataset$filtered_data$filter_step_run <- TRUE #adding a way to know if the filter step was run since it's optional
gimap_dataset$filtered_data$metadata_pg_ids <- gimap_dataset$metadata$pg_ids[!combined_filter,]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to myself, I'm going to have to go change the next steps of this code (the normalization and calc_crispr steps to use the filtered data instead.)

@cansavvy cansavvy merged commit aec7b20 into main Jul 3, 2024
6 of 7 checks passed
@cansavvy cansavvy deleted the filter_ki2 branch July 3, 2024 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants