Filtering step runs and stores result #40

kweav · 2024-06-27T19:12:17Z

⚠️ Please look at PR #41 first since this is based on that ⚠️

As of this PR: The filtering steps can now be run all together and filtering stores results of filtering; also includes another df in report about using multiple filters

To make these changes, I've:

(a) I've added a list for gimap_dataset$filtered_data within R/00-setup_data.R so that we store the filtered data, metadata, etc. I'll note that in this section, I've included
-- a boolean since the filtering step is optional so later steps can see if the filtering step was run/should use that data.
-- the transformed log2 CPM values (filtered)
-- the pgRNA IDs metadata (filtered)
-- the pgRNA general metadata (filtered) Note: noticed that this value was NULL, and wondering if this variable isn't needed because it's not really referenced in the code before this, it's NULL in the example gimap_dataset we load from gimap, and the annotation step seems to be grabbing its own metadata?

(b) Fixed a bug in R/01-qc.R where I passed the wrong parameter names to the .Rmd template.

(c) Renamed the output in the list from the qc_filter_plasmid() function to match the other filter function (filter and reportdf) and changed how I call them in R/02-filter.R and the template QC .Rmd

(d) Reverified/added target column selection parameters within the QC .Rmd as necessary

(e) Included finding a consensus filter of the possible filters (that works no matter how many filters the user wants)

(f) Stores the results following the consensus filter in the appropriate gimap_dataset$filtered_data list items (referenced in (a))

(g) Changed how the filter report df's are shown in the QC Template Rmd because wanted to be able to report number that would be filtered if combination of filters were used. So stored the full filtering output for each and called the reportdf immediately to display it, but then combined the potential filters themselves later on.

(h) Included a table with inline code to report number and percent or overlap of pgRNA constructs that would be filtered given various combinations of the filters (to better help the user make informed decisions on which filters they want to use)

(i) finally, added the filter step to the vignette and rendered it locally to make sure that the steps ran without error and I had the expected number of pgRNA constructs left after filtering.

Requested Review:

The requested review for this is:

(1) a verification that it works for you too
(2) is the documentation sufficient outside of the vignette? I know documentation needs to be added to the vignette, but want to wait until we finalize this code/these steps
(3) Does the setup/returned product work with next steps? (e.g., Do we need other data to be filtered/stored (like count_norm) or just log2_cpm? I looked at the old code and it seems like log2_cpm is the main workhorse).
(4) Where/when should I output a report of pgRNA constructs which have a raw count of 0 for all replicates? I'm thinking that's just good information to have whether those constructs are filtered or not, and so I can output that to another file when the bar plot is plotted in the QC template.
(5) Good next step to output a list of the pgRNA IDs that are filtered out? If so, where should that file live?
(6) Any other spots that I haven't notated throughout these stacked PRs that could use checking user input for validation?

Thank you! And please let me know if you have any questions or I can make something clearer. I noticed in a PR that Howard left inline comments explaining changes and that was really helpful for me, so I'll do that too connecting them to my lettered list above

…using multiple filters

kweav · 2024-06-27T19:20:30Z

R/00-setup_data.R

+    filtered_data = list(
+      filter_step_run = FALSE, #adding a way to know if the filter step was run since it's optional
+      metadata_pg_ids = NULL,
+      pg_metadata = NULL,
+      transformed_log2_cpm = NULL
+    ),


This is part of description (a)

kweav · 2024-06-27T19:21:48Z

R/00-setup_data.R

@@ -36,7 +36,12 @@ setup_data <- function(counts = NULL,
      pg_metadata = NULL,


This was still NULL after I ran through setup, qc, and filtering steps. Unnecessary because the info that would normally be associated with it is handled in the annotate function?

Yeah we could probably remove it. I don't think it ends up being used???

kweav · 2024-06-27T19:22:05Z

R/01-qc.R

+      filter_plasmid_target_col = filter_plasmid_target_col,
+      filter_replicates_target_col = filter_replicates_target_col


This corresponds to description (b)

kweav · 2024-06-27T19:22:22Z

R/02-filter.R

-    p_filter <- qc_filter_plasmid(gimap_dataset, cutoff = cutoff, filter_plasmid_target_col = filter_plasmid_target_col)$plasmid_filter
+    p_filter <- qc_filter_plasmid(gimap_dataset, cutoff = cutoff, filter_plasmid_target_col = filter_plasmid_target_col)$filter
  } else if (filter_type == "zero_count_only"){
    zc_filter <- qc_filter_zerocounts(gimap_dataset, filter_zerocount_target_col = filter_zerocount_target_col)$filter
  } else if(filter_type == "low_plasmid_cpm_only"){
-    p_filter <- qc_filter_plasmid(gimap_dataset, cutoff = cutoff, filter_plasmid_target_col = filter_plasmid_target_col)$plasmid_filter
+    p_filter <- qc_filter_plasmid(gimap_dataset, cutoff = cutoff, filter_plasmid_target_col = filter_plasmid_target_col)$filter


This is part of description (c)

kweav · 2024-06-27T19:22:38Z

R/02-filter.R

-#' @return a named list with the filter `plasmid_filter` specifying which pgRNAs have low plasmid log2 CPM (column of interest is `plasmid_cpm_filter`) and a report df `plasmid_filter_report` for the number and percent of pgRNA which have a low plasmid log2 CPM
+#' @return a named list with the filter `filter` specifying which pgRNAs have low plasmid log2 CPM (column of interest is `plasmid_cpm_filter`) and a report df `reportdf` for the number and percent of pgRNA which have a low plasmid log2 CPM


This is part of description (c)

kweav · 2024-06-27T19:22:49Z

R/02-filter.R

-    plasmid_filter = plasmid_cpm_filter,
-    plasmid_filter_report = plasmid_filter_df
+    filter = plasmid_cpm_filter,
+    reportdf = plasmid_filter_df


This is part of description (c)

kweav · 2024-06-27T19:23:03Z

inst/rmd/gimapQCTemplate.Rmd

@@ -65,7 +66,7 @@ qc_variance_hist(gimap_dataset, filter_replicates_target_col = filter_replicates
 ## Plasmid expression (log 2 cpm values for Day 0 sample/time point)

 ```{r}
-qc_plasmid_histogram(gimap_dataset)
+qc_plasmid_histogram(gimap_dataset, filter_plasmid_target_col = filter_plasmid_target_col)


This is part of description (d)

kweav · 2024-06-27T19:24:03Z

R/02-filter.R

@@ -71,10 +75,13 @@ gimap_filter <- function(.data = NULL,
  #then it finds the row sum (how many are filters flagged each construct e.g., number of TRUE in each row), 
  #and finally compares the row sum to the `min_n_filters` parameter to report TRUEs and FALSEs according to whether each construct is flagged by the minimum number of required filters
  #TRUE means it should be filtered, FALSE means it shouldn't be filtered
-  combined_filter <- rowSums(reduce(possible_filters, cbind)) >= min_n_filters 
-
+  combined_filter <- rowSums(reduce(possible_filters, cbind)) >= min_n_filters


This corresponds to description (e)

kweav · 2024-06-27T19:24:14Z

R/02-filter.R

-  gimap_dataset$filtered <- NULL #TODO: Filtered version of the data can be stored here
+  gimap_dataset$filtered_data$filter_step_run <- TRUE #adding a way to know if the filter step was run since it's optional
+  gimap_dataset$filtered_data$metadata_pg_ids <- gimap_dataset$metadata$pg_ids[!combined_filter,]
+  gimap_dataset$filtered_data$pg_metadata <- gimap_dataset$metadata$pg_metadata[!combined_filter,]
+  gimap_dataset$filtered_data$transformed_log2_cpm <- gimap_dataset$transformed_data$log2_cpm[!combined_filter,]


This corresponds to description (f)

kweav · 2024-06-27T19:24:40Z

inst/rmd/gimapQCTemplate.Rmd

-qc_filter_zerocounts(gimap_dataset, filter_zerocount_target_col = filter_zerocount_target_col)$reportdf
+potentialFilter1 <- qc_filter_zerocounts(gimap_dataset, filter_zerocount_target_col = filter_zerocount_target_col)
+
+potentialFilter1$reportdf


This is part of description (g)

kweav · 2024-06-27T19:24:54Z

inst/rmd/gimapQCTemplate.Rmd

-qc_filter_plasmid(gimap_dataset)$plasmid_filter_report
+potentialFilter2 <- qc_filter_plasmid(gimap_dataset, filter_plasmid_target_col = filter_plasmid_target_col)
+
+potentialFilter2$reportdf


This is part of description (g)

kweav · 2024-06-27T19:25:19Z

inst/rmd/gimapQCTemplate.Rmd

+### If both filters are applied
+
+```{r}
+combined_filters <- reduce(list(potentialFilter1$filter, potentialFilter2$filter), cbind)


This is part of description (g) "but then combined the potential filters themselves later on"

kweav · 2024-06-27T19:25:43Z

inst/rmd/gimapQCTemplate.Rmd

+| Which Filter(s) | Number of pgRNAs flagged for removal | Percent of total pgRNA constructs |
+|:---------------|:------------------------------------|:----------|
+| Zero count, but not low plasmid CPM | `r sum(combined_filters[,1] == TRUE & combined_filters[,2] == FALSE)` | `r round(sum(combined_filters[,1] == TRUE & combined_filters[,2] == FALSE) / nrow(combined_filters) * 100, 2)`|
+| Low plasmid CPM, but not zero count | `r sum(combined_filters[,2] == TRUE & combined_filters[,1] == FALSE)` | `r round(sum(combined_filters[,2] == TRUE & combined_filters[,1] == FALSE) / nrow(combined_filters) * 100, 2)` |
+| Either Zero count or Low plasmid CPM or both | `r sum(rowSums(combined_filters) >= 1)`| `r round(sum(rowSums(combined_filters) >= 1) / nrow(combined_filters) * 100, 2)`  |
+| Both Zero count and Low plasmid CPM | `r sum(rowSums(combined_filters) == 2)` | `r round(sum(rowSums(combined_filters) == 2) / nrow(combined_filters) * 100, 2)` |
+| Remaining pgRNAs flagged by no filters | `r sum(rowSums(combined_filters) == 0)` |  `r round(sum(rowSums(combined_filters) == 0) / nrow(combined_filters) * 100, 2)` |


This corresponds to description (h)

kweav · 2024-06-27T19:27:19Z

vignettes/getting_started.Rmd

+```{r}
+gimap_dataset <- gimap_dataset %>% 
+  gimap_filter()
+
+nrow(gimap_dataset$filtered_data$transformed_log2_cpm)
+
+```
+


This corresponds to description (i)

cansavvy

I think this all is sensible. I'll just need to update #26 to use the filtered data

cansavvy · 2024-07-03T18:14:25Z

inst/rmd/gimapQCTemplate.Rmd

+```
+
+| Which Filter(s) | Number of pgRNAs flagged for removal | Percent of total pgRNA constructs |
+|:---------------|:------------------------------------|:----------|


Love this table.

cansavvy · 2024-07-03T18:15:45Z

R/02-filter.R


-  gimap_dataset$filtered <- NULL #TODO: Filtered version of the data can be stored here
+  gimap_dataset$filtered_data$filter_step_run <- TRUE #adding a way to know if the filter step was run since it's optional
+  gimap_dataset$filtered_data$metadata_pg_ids <- gimap_dataset$metadata$pg_ids[!combined_filter,]


Reminder to myself, I'm going to have to go change the next steps of this code (the normalization and calc_crispr steps to use the filtered data instead.)

filter runs and stores things and include another df in report about …

d3277e2

…using multiple filters

kweav commented Jun 27, 2024

View reviewed changes

kweav mentioned this pull request Jun 27, 2024

Beginning of making filtering steps all work together #41

Merged

kweav requested a review from cansavvy June 27, 2024 19:41

Base automatically changed from filter_ki to main July 3, 2024 18:06

Merge branch 'main' into filter_ki2

984f087

cansavvy approved these changes Jul 3, 2024

View reviewed changes

cansavvy merged commit aec7b20 into main Jul 3, 2024
6 of 7 checks passed

cansavvy deleted the filter_ki2 branch July 3, 2024 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filtering step runs and stores result #40

Filtering step runs and stores result #40

kweav commented Jun 27, 2024 •

edited

Loading

kweav Jun 27, 2024

kweav Jun 27, 2024

cansavvy Jul 1, 2024

kweav Jun 27, 2024 •

edited

Loading

kweav Jun 27, 2024

kweav Jun 27, 2024

kweav Jun 27, 2024

kweav Jun 27, 2024

kweav Jun 27, 2024 •

edited

Loading

kweav Jun 27, 2024 •

edited

Loading

kweav Jun 27, 2024

kweav Jun 27, 2024

kweav Jun 27, 2024

kweav Jun 27, 2024

kweav Jun 27, 2024

cansavvy left a comment

cansavvy Jul 3, 2024

cansavvy Jul 3, 2024

		@@ -36,7 +36,12 @@ setup_data <- function(counts = NULL,
		pg_metadata = NULL,

		filter_plasmid_target_col = filter_plasmid_target_col,
		filter_replicates_target_col = filter_replicates_target_col

		#' @return a named list with the filter `plasmid_filter` specifying which pgRNAs have low plasmid log2 CPM (column of interest is `plasmid_cpm_filter`) and a report df `plasmid_filter_report` for the number and percent of pgRNA which have a low plasmid log2 CPM
		#' @return a named list with the filter `filter` specifying which pgRNAs have low plasmid log2 CPM (column of interest is `plasmid_cpm_filter`) and a report df `reportdf` for the number and percent of pgRNA which have a low plasmid log2 CPM

Filtering step runs and stores result #40

Filtering step runs and stores result #40

Conversation

kweav commented Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kweav Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kweav Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

kweav Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cansavvy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kweav commented Jun 27, 2024 •

edited

Loading

kweav Jun 27, 2024 •

edited

Loading

kweav Jun 27, 2024 •

edited

Loading

kweav Jun 27, 2024 •

edited

Loading