The de novo method is developed within a causal inference framework and in the context of matched observational studies. The denovo R package implements a novel statistical method that discovers subgroups whose causal effects of the variable of interest (e.g., air pollution on mortality) are statistically significantly different from the population average.
In the first sub-sample, we let data discover the "promising" subgroup with air pollution effects that differ from the population mean. In this step, machine learning approaches (e.g., classification and regression trees (CART) and Causal Tree) are used to discover promising groups. In the second subsample, we develop randomization-based hypothesis tests to confirm whether there is evidence that exposure effects for the newly discovered subgroups are statistically significantly different from the population average causal effect.
User the following instruction to install the denovo package from source:
install.packages("devtools")
library(devtools)
install_github("fasrc_denovo/master")
There are two R packages (causalTree & Gurobi) that cannot be installed from CRAN. Users need to install these packages manually. To install the "causalTree" package, please use the following instruction (see causalTree for more details):
install.packages("devtools")
library(devtools)
install_github("susanathey/causalTree")
denovo package uses Gurobi optimizer in sensitivity analyses. For academic use, you can download and install it from here. For R wrapper, please visit Gurobi installation.
denovo functions can be used for both binary and continues outcomes. The discover_subgroups
function, get's the first sub-sample and generates a classification and regression tree.
discovered_tree <- discover_subgroups(tr_1, cr_1, covars_1)
In this function, tr
is the vector of control outcomes, cr
is the vector of control outcomes, and covars
is a data.frame for covariates. The output is the discovered tree or classification and regression prediction model. The estimate_subgroups_sig
function receives the second sub-sample as well as the prediction model, and estimates the significance of each sub-groups.
analysis <- estimate_subgroups_sig(tr_2, cr_2, covars_sig_2,
tree = discovered_tree$tree,
significance = total_significance,
gamma = gamma)
The estimate_exposure_eff
function uses the mentioned functions to discovery of effect modification under no unmeasured confounder assumption. See the following section for more details.
We provide analyses on synthetic data to address the following research question, which population subgroups have causal effects of air pollution on mortality that are statistically significantly different from population average? The actual study has been conducted on Medicare data, however, the data is not open to public, as a result we redo the process with synthetic data. These analyses are further discussed in Lee et al (2021). Please refer to the following link for more details.
- Lee, K., Small, D.S. and Dominici, F., 2021. Discovering Heterogeneous Exposure Effects Using Randomization Inference in Air Pollution Studies. Journal of the American Statistical Association, pp.1-12.
- Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., 1984. Classification and Regression Trees, New York: Chapman &Hall/CRC.
- Athey, S. and Imbens, G., 2016. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), pp.7353-7360.