Sometimes a statistical analysis is to be repeated in subgroups defined by one or more variables. This is particularly common in clinical trials, where one component of understanding the effect of the investigational treatment may be understanding whether the effect is substantially stronger in particular groups of patients.
Recently I needed to repeat an analysis in subgroups defined by each of about 12 variables. In R, I used nested for loops to accomplish the task, but I suspected there was a more concise way using tidyverse functions. Here’s what I found.
Setup
Here are some fake data I’ll use in the following code examples. Consider x1–x3 to be grouping variables, and y an outcome variable.
library(dplyr) # for tidy data manipulation
n <- 1000
tbl <-
tibble(x1 = sample(1:2, n, replace = TRUE),
x2 = sample(1:3, n, replace = TRUE),
x3 = sample(1:4, n, replace = TRUE),
y = rnorm(n)) %>%
mutate(across(x1:x3, factor))
head(tbl)
## # A tibble: 6 × 4
## x1 x2 x3 y
## <fct> <fct> <fct> <dbl>
## 1 2 2 1 0.240
## 2 2 1 2 0.889
## 3 1 2 3 -0.000947
## 4 1 3 1 -1.22
## 5 1 1 4 -0.300
## 6 1 2 4 1.20
One subgroup analysis
First, to illustrate the problem, suppose we want to repeat an analysis in subgroups defined by only one variable, x1. Using a for loop might look like this:
library(broom) # for tidy model results
# The analysis to be done in each subgroup.
# Returns, e.g., point and interval estimates
# for a parameter.
do_analysis <- function (data) {
lm(y ~ 1, data = data) %>%
tidy(conf.int = TRUE) %>%
select(estimate, conf.low, conf.high)
}
output <- NULL
for (lvl in levels(tbl$x1)) {
tmp <-
tbl %>%
filter(x1 %in% lvl)
output <-
output %>%
bind_rows(do_analysis(tmp) %>%
mutate(x1 = lvl,
.before = 1))
}
output
## # A tibble: 2 × 4
## x1 estimate conf.low conf.high
## <chr> <dbl> <dbl> <dbl>
## 1 1 0.00524 -0.0807 0.0912
## 2 2 -0.0716 -0.160 0.0165
The same task can be accomplished in a slightly more concise way, in one chain of “piped” operations, using functions from the tidyr and purrr packages:
library(purrr) # for mapping
library(tidyr) # for nesting
tbl %>%
# For each unique value of x1, collapse the
# remaining data into a data frame and put
# these data frames in a column named 'data'
nest(data = -x1) %>%
# Perform analysis on each data frame
mutate(est = map(data, do_analysis)) %>%
# Unpack the analysis results
unnest(cols = est) %>%
select(-data)
## # A tibble: 2 × 4
## x1 estimate conf.low conf.high
## <fct> <dbl> <dbl> <dbl>
## 1 2 -0.0716 -0.160 0.0165
## 2 1 0.00524 -0.0807 0.0912
Multiple subgroup analyses
Similarly, when analysis needs to be repeated across subgroups defined by multiple variables, one option is to use nested for loops:
output <- NULL
subgroup.vars <- c("x1", "x2", "x3")
for (var in subgroup.vars) {
for (lvl in levels(tbl[[var]])) {
tmp <- tbl[tbl[[var]] %in% lvl, ]
output <-
output %>%
bind_rows(do_analysis(tmp) %>%
mutate(var = var,
lvl = lvl,
.before = 1))
}
}
output
## # A tibble: 9 × 5
## var lvl estimate conf.low conf.high
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 x1 1 0.00524 -0.0807 0.0912
## 2 x1 2 -0.0716 -0.160 0.0165
## 3 x2 1 -0.0746 -0.179 0.0294
## 4 x2 2 0.0405 -0.0693 0.150
## 5 x2 3 -0.0548 -0.161 0.0518
## 6 x3 1 -0.0360 -0.164 0.0919
## 7 x3 2 -0.0950 -0.217 0.0272
## 8 x3 3 -0.0240 -0.151 0.103
## 9 x3 4 0.0253 -0.0902 0.141
Alternatively, one can use the nest and map functions, but this time it’s a little tricky. First, I create a table of the subgroup variables and their levels:
tmp <-
tibble(var = subgroup.vars) %>%
mutate(lvl = map(var, ~ levels(tbl[[.x]]))) %>%
unnest(cols = lvl)
tmp
## # A tibble: 9 × 2
## var lvl
## <chr> <chr>
## 1 x1 1
## 2 x1 2
## 3 x2 1
## 4 x2 2
## 5 x2 3
## 6 x3 1
## 7 x3 2
## 8 x3 3
## 9 x3 4
Then I add the data sets corresponding to each subgroup. In the resulting table, the original data set has been replicated (while also being split by subgroups). Each original observation appears times, where is the number of subgroup variables.
tmp %>%
mutate(data =
map2(var, lvl,
~ tbl[tbl[[.x]] %in% .y, ]))
## # A tibble: 9 × 3
## var lvl data
## <chr> <chr> <list>
## 1 x1 1 <tibble [521 × 4]>
## 2 x1 2 <tibble [479 × 4]>
## 3 x2 1 <tibble [342 × 4]>
## 4 x2 2 <tibble [315 × 4]>
## 5 x2 3 <tibble [343 × 4]>
## 6 x3 1 <tibble [232 × 4]>
## 7 x3 2 <tibble [243 × 4]>
## 8 x3 3 <tibble [272 × 4]>
## 9 x3 4 <tibble [253 × 4]>
Finally, we perform analysis on each data set and unpack the results. Altogether the operation looks like this:
tibble(var = subgroup.vars) %>%
mutate(lvl = map(var, ~ levels(tbl[[.x]]))) %>%
unnest(cols = lvl) %>%
mutate(data =
map2(var, lvl,
~ tbl[tbl[[.x]] %in% .y, ])) %>%
mutate(est = map(data, do_analysis)) %>%
unnest(cols = est) %>%
select(-data)
## # A tibble: 9 × 5
## var lvl estimate conf.low conf.high
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 x1 1 0.00524 -0.0807 0.0912
## 2 x1 2 -0.0716 -0.160 0.0165
## 3 x2 1 -0.0746 -0.179 0.0294
## 4 x2 2 0.0405 -0.0693 0.150
## 5 x2 3 -0.0548 -0.161 0.0518
## 6 x3 1 -0.0360 -0.164 0.0919
## 7 x3 2 -0.0950 -0.217 0.0272
## 8 x3 3 -0.0240 -0.151 0.103
## 9 x3 4 0.0253 -0.0902 0.141
Is this method really the more concise alternative that I was looking for? Not exactly; it’s similar in the amount of code, compared to the nested for loops. Although I like having one chain of operations — and no intermediate/temporary objects — in the style of the tidyverse, this method is less efficient in a way (by replicating the data), and the code might be more difficult to understand than the nested for loops.
Conclusion
For a single subgroup analysis, I prefer the more concise method using the tidyverse’s nest-map-unnest functions.
For multiple subgroup variables, it’s possible to use tidyverse functions in a single chain of operations, but it’s less clear whether this method is an improvement over for loops.