--- title: "Using `cobalt` with Other Preprocessing Packages" author: "Noah Greifer" date: "`r Sys.Date()`" output: html_vignette: df_print: kable toc: true vignette: > %\VignetteIndexEntry{Using `cobalt` with Other Preprocessing Packages} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib link-citations: true --- ```{r, include = FALSE} knitr::opts_chunk$set(message = FALSE) library("cobalt") ``` This is an appendix to the main vignette, "Covariate Balance Tables and Plots: A Guide to the cobalt Package". It contains descriptions and demonstrations of several utility functions in `cobalt` and the use of `bal.tab()` with `twang`, `Matching`, `optmatch`, `CBPS`, `ebal`, `designmatch`, `sbw`, `MatchThem`, and `cem`. Note that `MatchIt` can perform most of the functions that `Matching`, `optmatch`, and `cem` can, and `WeightIt` can perform most of the functions that `twang`, `CBPS`, `ebal`, and `sbw` can. Because `cobalt` has been optimized to work with `MatchIt` and `WeightIt`, it is recommended to use those packages to simplify preprocessing and balance assessment, but we recognize users may prefer to use the packages described in this vignette. ## Utilities In addition to its main balance assessment functions, `cobalt` contains several utility functions. These are meant to reduce the typing and programming burden that often accompany the use of R with a diverse set of packages. ### `splitfactor()` and `unsplitfactor()` Some functions (outside of `cobalt`) are not friendly to factor or character variables, and require numeric variables to operate correctly. For example, some regression-style functions, such as `ebalance()` in `ebal`, can only take in non-singular numeric matrices. Other functions will process factor variables, but will return output in terms of dummy coded version of the factors. For example, `lm()` will create dummy variables out of a factor and drop the reference category to create regression coefficients. To prepare data sets for use in functions that do not allow factors or to mimic the output of functions that split factor variables, users can use `splitfactor()`, which takes in a data set and the names of variables to split, and outputs a new data set with newly created dummy variables. Below is an example splitting the `race` variable in the Lalonde data set into dummies, eliminating the reference category (`"black"`): ```{r} head(lalonde) lalonde.split <- splitfactor(lalonde, "race") head(lalonde.split) ``` It is possible to undo the action of `splitfactor()` with `unsplitfactor()`, which takes in a data set with dummy variables formed from `splitfactor()` or otherwise and recreates the original factor variable. If the reference category was dropped, its value needs to be supplied. ```{r} lalonde.unsplit <- unsplitfactor(lalonde.split, "race", dropped.level = "black") head(lalonde.unsplit) ``` Notice the original data set and the unsplit data set look identical. If the input to `unsplitfactor()` is the output of a call to `splitfactor()` (as it was here), you don't need to tell `unsplitfactor()` the name of the split variable or the value of the dropped level. It was done here for illustration purposes. ### `get.w()` `get.w()` allows users to extract weights from the output of a call to a preprocessing function in one of the supported packages. Because each package stores weights in different ways, it can be helpful to have a single function that applies equally to all outputs. `twang` has a function called `get.weights()` that performs the same functions with slightly finer control for the output of a call to `ps()`. ## `bal.tab()` The next sections describe the use of `bal.tab()` with packages other than those described in the main vignette. Even if you are using `bal.tab()` with one of these packages, it may be useful to read the main vignette to understand `bal.tab()`'s main options, which are not detailed here. ### Using `bal.tab()` with `twang` Generalized boosted modeling (GBM), as implemented in `twang`, can be an effective way to generate propensity scores and weights for use in propensity score weighting. `bal.tab()` functions similarly to the functions `bal.table()` and `summary()` when used with GBM in `twang`. Below is a simple example of its use: ```{r, warning = FALSE, eval = requireNamespace("twang", quietly = TRUE)} #GBM PS weighting for the ATT data("lalonde", package = "cobalt") ##If not yet loaded covs0 <- subset(lalonde, select = -c(treat, re78)) f <- reformulate(names(covs0), "treat") ps.out <- twang::ps(f, data = lalonde, stop.method = c("es.mean", "es.max"), estimand = "ATT", n.trees = 1000, verbose = FALSE) bal.tab(ps.out, stop.method = "es.mean") ``` The output looks a bit different from `twang`'s `bal.table()` output. First is the original call to `ps()`. Next is the balance table containing mean differences for the covariates included in the input to `ps()`. Last is a table displaying sample size information, similar to what would be generated using `twang`'s `summary()` function. The "effective" sample size is displayed when weighting is used; it is calculated as is done in `twang`. See the `twang` documentation, `?bal.tab`, or "Details on Calculations" in the main vignette for details on this calculation. When using `bal.tab()` with `twang`, the user must specify the `ps` object, the output of a call to `ps()`, as the first argument. The second argument, `stop.method`, is the name of the stop method(s) for which balance is to be assessed, since a `ps` object may contain more than one if so specified. `bal.tab()` can display the balance for more than one stop method at a time by specifying a vector of stop method names. If this argument is left empty or if the argument to `stop.method` does not correspond to any of the stop methods in the `ps` object, `bal.tab()` will default to displaying balance for all stop methods available. Abbreviations are allowed for the stop method, which is not case sensitive. The other arguments to `bal.tab()` when using it with `twang` have the same form and function as those given when using it without a conditioning package, except for `s.d.denom`. If the estimand of the stop method used is the ATT, `s.d.denom` will default to `"treated"` if not specified, and if the estimand is the ATE, `s.d.denom` will default to `"pooled"`, mimicking the behavior of `twang`. The user can specify their own argument to `s.d.denom`, but using the defaults is advised. If sampling weights are used in the call to `ps()`, they will be automatically incorporated into the `bal.tab()` calculations for both the adjusted and unadjusted samples, just as `twang` does. `mnps` objects resulting from fitting models in `twang` with multi-category treatments are also compatible with `cobalt`. See the section "Using `cobalt` with multi-category treatments" in the main vignette. `iptw` objects resulting from fitting models in `twang` with longitudinal treatments are also compatible with `cobalt`. See the Appendix 3 vignette. `ps.cont` objects resulting from using `ps.cont()` in `twangContinuous`, which implements GBM for continuous treatments, are also compatible. See the section "Using `cobalt` with continuous treatments" in the main vignette. ### Using `bal.tab()` with `Matching` The `Matching` package is used for propensity score matching, and was also the first package to implement genetic matching. `MatchIt` calls `Matching` to use genetic matching and can accomplish many of the matching methods `Matching` can, but `Matching` is still a widely used package with its own strengths. `bal.tab()` functions similarly to `Matching`'s `MatchBalance()` command, which yields a thorough presentation of balance. Below is a simple example of the use of `bal.tab()` with `Matching`: ```{r, eval = requireNamespace("Matching", quietly = TRUE)} #1:1 NN PS matching w/ replacement data("lalonde", package = "cobalt") #If not yet loaded covs0 <- subset(lalonde, select = -c(treat, re78)) f <- reformulate(names(covs0), "treat") fit <- glm(f, data = lalonde, family = binomial) p.score <- fit$fitted.values match.out <- Matching::Match(Tr = lalonde$treat, X = p.score, estimand = "ATT") bal.tab(match.out, formula = f, data = lalonde, distance = ~ p.score) ``` The output looks quite different from `Matching`'s `MatchBalance()` output. Rather than being stacked vertically, balance statistics are arranged horizontally in a table format, allowing for quick balance checking. Below the balance table is a summary of the sample size before and after matching, similar to what `Matching`'s `summary()` command would display. The sample size can include an "ESS" and "unweighted" value; the "ESS" value is the effective sample size resulting from the matching weights, while the "unweighted" is the count of units with nonzero matching weights. The input to `bal.tab()` is similar to that given to `MatchBalance()`: the `Match` object resulting from the call to `Match()`, a formula relating treatment to the covariates for which balance is to be assessed, and the original data set. This is not the only way to call `bal.tab()`: instead of a formula and a data set, one can also input a data frame of covariates and a vector of treatment status indicators, just as when using `bal.tab()` without a conditioning package. For example, the code below will yield the same results as the call to `bal.tab()` above: ```{r, eval = FALSE} bal.tab(match.out, treat = lalonde$treat, covs = covs0, distance = ~ p.score) ``` The other arguments to `bal.tab()` when using it with `Matching` have the same form and function as those given when using it without a conditioning package, except for `s.d.denom`. If the estimand of the original call to `Match()` is the ATT, `s.d.denom` will default to `"treated"` if not specified; if the estimand is the ATE, `s.d.denom` will default to `"pooled"`; if the estimand is the ATC, `s.d.denom` will default to `"control"`. The user can specify their own argument to `s.d.denom`, but using the defaults is advisable. In addition, the use of the `addl` argument is unnecessary because the covariates are entered manually as arguments, so all covariates for which balance is to be assessed can be entered through the `formula` or `covs` argument. If the covariates are stored in two separate data frames, it may be useful to include one in `formula` or `covs` and the other in `addl`. ### Using `bal.tab()` with `optmatch` The `optmatch` package is useful for performing optimal pairwise or full matching. Most functions in `optmatch` are subsumed in `MatchIt`, but `optmatch` sees use from those who want finer control of the matching process than `MatchIt` allows. The output of calls to functions in `optmatch` is an `optmatch` object, which contains matching stratum membership for each unit in the given data set. Units that are matched with each other are assigned the same matching stratum. The user guide for `optmatch` recommends using the `RItools` package for balance assessment, but below is an example of how to use `bal.tab()` for the same purpose. Note that some results will differ between `cobalt` and `RItools` because of differences in how balance is calculated in each. ```{r, eval = requireNamespace("optmatch", quietly = TRUE)} #Optimal full matching on the propensity score data("lalonde", package = "cobalt") #If not yet loaded covs0 <- subset(lalonde, select = -c(treat, re78)) f <- reformulate(names(covs0), "treat") fit <- glm(f, data = lalonde, family = binomial) p.score <- fit$fitted.values #get the propensity score fm <- optmatch::fullmatch(treat ~ p.score, data = lalonde) bal.tab(fm, covs = covs0, distance = ~ p.score) ``` Most details for the use of `bal.tab()` with `optmatch` are similar to those when using `bal.tab()` with `Matching`. Users can enter either a formula and a data set or a vector of treatment status and a set of covariates. Unlike with `Matching`, entering the treatment variable is optional as it is already stored in the `optmatch` object. `bal.tab()` is compatible with both `pairmatch()` and `fullmatch()` output. ### Using `bal.tab()` with `CBPS` The `CBPS` (Covariate Balancing Propensity Score) package is a great tool for generating covariate balancing propensity scores, a class of propensity scores that are quite effective at balancing covariates among groups. `CBPS` includes functions for estimating propensity scores for binary, multi-category, and continuous treatments. `bal.tab()` functions similarly to `CBPS`'s `balance()` command. Below is a simple example of its use with a binary treatment: ```{r, eval = requireNamespace("CBPS", quietly = TRUE)} #CBPS weighting data("lalonde", package = "cobalt") #If not yet loaded covs0 <- subset(lalonde, select = -c(treat, re78)) f <- reformulate(names(covs0), "treat") #Generating covariate balancing propensity score weights for ATT cbps.out <- CBPS::CBPS(f, data = lalonde) bal.tab(cbps.out) ``` First is the original call to `CBPS()`. Next is the balance table containing mean differences for the covariates included in the input to `CBPS()`. Last is a table displaying sample size information. The "effective" sample size is displayed when weighting (rather than matching or subclassification) is used; it is calculated as is done in `twang`. See the `twang` documentation, `?bal.tab`, or "Details on Calculations" in the main vignette for details on this calculation. The other arguments to `bal.tab()` when using it with `CBPS` have the same form and function as those given when using it without a conditioning package, except for `s.d.denom`. If the estimand of the original call to `CBPS()` is the ATT, `s.d.denom` will default to `"treated"` if not specified, and if the estimand is the ATE, `s.d.denom` will default to `"pooled"`. The user can specify their own argument to `s.d.denom`, but using the defaults is advisable. `CBPSContinuous` objects resulting from fitting models in `CBPS` with continuous treatments are also compatible with `cobalt`. See the section "Using `cobalt` with continuous treatments" in the main vignette. `CBPS` objects resulting from fitting models in `CBPS` with multi-category treatments are also compatible with `cobalt`. See the section "Using `cobalt` with multi-category treatments" in the main vignette. `CBMSM` objects resulting from fitting models in `CBPS` with longitudinal treatments are also compatible with `cobalt`. See the Appendix 3 vignette. ### Using `bal.tab()` with `ebal` The `ebal` package implements entropy balancing, a method of weighting for the ATT that yields perfect balance on all desired moments of the covariate distributions between groups. Rather than estimate a propensity score, entropy balancing generates weights directly that satisfy a user-defined moment condition, specifying which moments are to be balanced. Note that all the functionality of `ebal` is contained within `Weightit`. `ebal` does not have its own balance assessment function; thus, `cobalt` is the only way to assess balance without programming, which the `ebal` documentation instructs. Below is a simple example of using `bal.tab()` with `ebal`: ```{r, eval = requireNamespace("ebal", quietly = TRUE)} #Entropy balancing data("lalonde", package = "cobalt") #If not yet loaded covs0 <- subset(lalonde, select = -c(treat, re78, race)) #Generating entropy balancing weights e.out <- ebal::ebalance(lalonde$treat, covs0) bal.tab(e.out, treat = lalonde$treat, covs = covs0) ``` First is the balance table containing mean differences for covariates included in the original call to `ebalance`. In general, these will all be very close to 0. Next is a table displaying effective sample size information. See `?bal.tab` or "Details on Calculations" in the main vignette for details on this calculation. A common issue when using entropy balancing is small effective sample size, which can yield low precision in effect estimation when using weighted regression, so it is important that users pay attention to this measure. The input is similar to that for using `bal.tab()` with `Matching` or `optmatch`. In addition to the `ebalance` object, one must specify either both a formula and a data set or both a treatment vector and a data frame of covariates. ### Using `bal.tab()` with `designmatch` The `designmatch` package implements various matching methods that use optimization to find matches that satisfy certain balance constraints. `bal.tab()` functions similarly to `designmatch`'s `meantab()` command but provides additional flexibility and convenience. Below is a simple example of using `bal.tab()` with `designmatch`: ```{r, eval = requireNamespace("designmatch", quietly = TRUE)} #Mixed integer programming matching library("designmatch") data("lalonde", package = "cobalt") #If not yet loaded covs0 <- subset(lalonde, select = -c(treat, re78, race)) #Matching for balance on covariates dmout <- bmatch(lalonde$treat, dist_mat = NULL, subset_weight = NULL, mom = list(covs = covs0, tols = absstddif(covs0, lalonde$treat, .05)), n_controls = 1, total_groups = 185) bal.tab(dmout, treat = lalonde$treat, covs = covs0) ``` The input is similar to that for using `bal.tab()` with `Matching` or `optmatch`. In addition to the `designmatch()` output object, one must specify either both a formula and a data set or both a treatment vector and a data frame of covariates. The output is similar to that of `optmatch`. ### Using `bal.tab()` with `sbw` The `sbw` package implements optimization-based weighting to estimate weights that satisfy certain balance constraints and have minimal variance. `bal.tab()` functions similarly to `sbw`'s `summarize()` function but provides additional flexibility and convenience. Below is a simple example of using `bal.tab()` with `sbw`: ```{r, eval = requireNamespace("sbw", quietly = TRUE)} #Optimization-based weighting data("lalonde", package = "cobalt") #If not yet loaded lalonde_split <- splitfactor(lalonde, drop.first = "if2") cov.names <- setdiff(names(lalonde_split), c("treat", "re78")) #Estimating balancing weights for the ATT sbw.out <- sbw::sbw(lalonde_split, ind = "treat", bal = list(bal_cov = cov.names, bal_alg = FALSE, bal_tol = .001), par = list(par_est = "att")) bal.tab(sbw.out, un = TRUE, disp.means = TRUE) ``` The output is similar to the output of a call to `summarize()`. Rather than stack several balance tables vertically, each with their own balance summary, here they are displayed horizontally. Note that due to differences in how `sbw` and `cobalt` compute the standardization factor in the standardized mean difference, values may not be identical between `bal.tab()` and `summarize()`. Also note that `bal.tab()`'s default is to display raw rather than standardized mean differences for binary variables. ### Using `bal.tab()` with `MatchThem` The `MatchThem` package is essentially a wrapper for `matchit()` from `MatchIt` and `weightit()` from `WeightIt` but for use with multiply imputed data. Using `bal.tab()` on `mimids` or `wimids` objects from `MatchThem` activates the features that accompany multiply imputed data; balance is assessed within each imputed dataset and aggregated across imputations. See `?bal.tab.imp` or `vignette("segmented-data")` for more information about using `cobalt` with multiply imputed data. Below is a simple example of using `bal.tab()` with `MatchThem`: ```{r, eval = requireNamespace("MatchThem", quietly = TRUE) && requireNamespace("mice", quietly = TRUE)} #PS weighting on multiply imputed data data("lalonde_mis", package = "cobalt") #Generate imputed data sets m <- 10 #number of imputed data sets imp.out <- mice::mice(lalonde_mis, m = m, print = FALSE) #Matching for balance on covariates mt.out <- MatchThem::matchthem(treat ~ age + educ + married + race + re74 + re75, datasets = imp.out, approach = "within", method = "nearest", estimand = "ATT") bal.tab(mt.out) #Weighting for balance on covariates wt.out <- MatchThem::weightthem(treat ~ age + educ + married + race + re74 + re75, datasets = imp.out, approach = "within", method = "glm", estimand = "ATE") bal.tab(wt.out) ``` The input is similar to that for using `bal.tab()` with `MatchIt` or `WeightIt`. ### Using `bal.tab()` with `cem` The `cem` package implements coarsened exact matching for binary and multi-category treatments. `bal.tab()` functions similarly to `cems`'s `imbalance()`. Below is a simple example of using `bal.tab()` with `cem`: ```{r, eval = FALSE && requireNamespace("cem", quietly = TRUE)} #Coarsened exact matching data("lalonde", package = "cobalt") #If not yet loaded #Matching for balance on covariates cem.out <- cem::cem("treat", data = lalonde, drop = "re78") bal.tab(cem.out, data = lalonde, stats = c("m", "ks")) ``` The input is similar to that for using `bal.tab()` with `Matching` or `optmatch`. In addition to the `cem()` output object, one must specify either both a formula and a data set or both a treatment vector and a data frame of covariates. Unlike with `Matching`, entering the treatment variable is optional as it is already stored in the output object. The output is similar to that of `optmatch`. When using `cem()` with multiply imputed data (i.e., by supplying a list of data.frames to the `datalist` argument in `cem()`), an argument to `imp` should be specified to `bal.tab()` or a `mids` object from the `mice` package should be given as the argument to `data`. See `?bal.tab.imp` or `vignette("segmented-data")` for more information about using `cobalt` with multiply imputed data. Below is an example of using `cem` with multiply imputed data from `mice`: ```{r, message = F, eval = FALSE && all(sapply(c("cem", "mice"), requireNamespace, quietly = TRUE))} #Coarsened exact matching on multiply imputed data data("lalonde_mis", package = "cobalt") #Generate imputed data sets m <- 10 #number of imputed data sets imp.out <- mice::mice(lalonde_mis, m = m, print = FALSE) imp.data.list <- mice::complete(imp.out, "all") #Match within each imputed dataset cem.out.imp <- cem::cem("treat", datalist = imp.data.list, drop = "re78") bal.tab(cem.out.imp, data = imp.out) ``` ### Using `bal.tab()` with other packages It is possible to use `bal.tab()` with objects that don't come from these packages using the `default` method. If an object that doesn't correspond to the output from one of the specifically supported packages is passed as the first argument to `bal.tab()`, `bal.tab()` will do its best to process that object as if it did come from a supported package. It will search through the components of the object for items with names like `"treat"`, `"covs"`, `"data"`, `"weights"`, etc., that have the correct object types. Any additional arguments can be specified by the user. The goal of the `default` method is to allow package authors to rely on `cobalt` as a substitute for any balancing function they might otherwise write. By ensuring compatibility with the `default` method, package authors can have their users simply supply the output of a compatible function into `cobalt` functions without having to write a specific method in `cobalt`. A package author would need to make sure the output of their package contained enough information with correctly named components; if so, `cobalt` functions can be used as conveniently with the output as it is with specifically supported packages. Below, we demonstrate this capability with the output of `optweight`, which performs a version of propensity score weighting using optimization, similar to `sbw`. No `bal.tab` method has been written with `optweight` output in mind; rather, `optweight` was written to have output compatible with the `default` method of `bal.tab`. ```{r, eval = requireNamespace("optweight", quietly = TRUE)} #Optimization-based weighting data("lalonde", package = "cobalt") #Estimate the weights using optimization ow.out <- optweight::optweight(treat ~ age + educ + married + race + re74 + re75, data = lalonde, estimand = "ATE", tols = .01) #Note the contents of the output object: names(ow.out) #Use bal.tab() directly on the output bal.tab(ow.out) ``` The output is treated as output from a specifically supported package. See `?bal.tab.default` for more details and another example.