8 min read

Lessons in statistical programming: part 1

I really started learning how to use R for statistical programming in my first job after completing an MS in statistics. Graduate courses in statistics generally teach some R, but programming for coursework doesn’t have nearly the same demands as a full-time job. My statistician role at Mayo Clinic encouraged me to improve my programming for three reasons:

  1. For nearly all of the analysis and documents that I produced, the audience was a team of busy, clinically — not statistically — trained investigators and medical trainees. They were mainly hospital physicians/surgeons, epidemiologists, and medical fellows/residents. I needed to present the results of my analysis in a form that was more-or-less clear and digestible to this audience, either in a meeting or (with more difficulty) over email. Creating clear, concise analysis documents that were also reproducible — meaning, they could be reproduced exactly the same as the first time by running code rather than manually assembling a document — required learning new tools.
  2. Both the extent of each individual project, and the number of projects going on simultaneously, increased several fold. Having more extensive projects meant that the amount of analysis and code grew beyond what I could reasonably manage without losing track and making mistakes if I didn’t establish a system for keeping organized. Having a greater number or projects made it all the more important to keep organized, and it meant that similar tasks got repeated many times; when I was accomplishing a certain task in an inefficient or mistake-prone way, usually it didn’t take long for that to become apparent.
  3. The stakes were somewhat higher when conducting analysis that might get published in a medical journal, compared to programming for coursework. Additionally, if there were frequently mistakes — even minor ones — in my work, it could undermine the trust that the investigators placed in me to handle their data and produce analysis that they were comfortable defending.

The following are some lessons I’ve learned as my statistical programming developed during five years of working in clinical research.

Learn to create readable, reproducible reports

In clinical research, as in nearly all applications of statistics (I imagine), the results of statistical analysis primarily need to be communicated to non-statisticians. If I wanted to show the results of a linear regression to another statistician, there would be no problem bringing up an R console and simply showing them console output:

fit <-
  lm(outcome ~ sex + age.10 + treatment,
     data = patients)
summary(fit)
## 
## Call:
## lm(formula = outcome ~ sex + age.10 + treatment, data = patients)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.325  -6.861   0.850   8.688  23.996 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.069      5.854   0.524   0.6014    
## sexMale        4.874      2.387   2.042   0.0439 *  
## age.10        -5.843      1.040  -5.618 1.92e-07 ***
## treatmentB     2.122      2.962   0.716   0.4756    
## treatmentC     4.999      2.769   1.805   0.0742 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.65 on 95 degrees of freedom
## Multiple R-squared:  0.2987, Adjusted R-squared:  0.2691 
## F-statistic: 10.11 on 4 and 95 DF,  p-value: 7.298e-07

A statistician would know where to find the pertinent information here, and wouldn’t be bothered much by everything else. Many non-statisticians familiar with linear regression could read this fine too. But there’s quite a bit of extraneous “ink” here: the Call and Residuals, Std. Error and t value, more digits than needed in the table, and some esoteric numbers and labels. Plus, there are no confidence intervals. To meet the needs of a research team whose members have a range of experience with statistics, this output needs to be distilled and formatted in a more user-friendly manner.

I’ve used a few different methods to make the output more presentable. When it comes to statistical models such as linear regression, one step often is to extract information into a data frame using the broom package’s tidy (and occasionally glance) functions.

library(broom)
tidy(fit, conf.int = TRUE)
## # A tibble: 5 × 7
##   term        estimate std.error statistic     p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>       <dbl>    <dbl>     <dbl>
## 1 (Intercept)     3.07      5.85     0.524 0.601         -8.55      14.7 
## 2 sexMale         4.87      2.39     2.04  0.0439         0.136      9.61
## 3 age.10         -5.84      1.04    -5.62  0.000000192   -7.91      -3.78
## 4 treatmentB      2.12      2.96     0.716 0.476         -3.76       8.00
## 5 treatmentC      5.00      2.77     1.81  0.0742        -0.499     10.5

Then I like to modify the table in the following ways:

  • Drop the (Intercept) row if not relevant for the analysis.
  • Create a single, formatted estimate of the form point (95% lower, upper), with an appropriate number of digits.
  • Re-format P-values to have two significant digits if \(\geq\) 0.10, one digit if < 0.10 but not smaller than, say, 0.0001, and “< 0.0001” otherwise.
  • Replace variable names with reader-friendly labels, in a manner similar to how labels are used in SAS. This is unnecessary for many variables (male and treatment are clear enough), but with continuous variables it’s helpful to include units, and complicated or uncommon variables sometimes warrant more a descriptive label.
  • For estimates associated with levels of a categorical variable, include a row indicating the reference level.

My final table typically looks like this:

## # A tibble: 6 × 4
##   Variable            Level  `Coefficient (95% CI)` `P-value`
##   <chr>               <chr>  <chr>                  <chr>    
## 1 "Sex"               Female ---                    ---      
## 2 ""                  Male   4.9 (0.1, 9.6)         0.04     
## 3 "Age, per 10 years" ---    -5.8 (-7.9, -3.8)      <0.0001  
## 4 "Treatment"         A      ---                    ---      
## 5 ""                  B      2.1 (-3.8, 8.0)        0.48     
## 6 ""                  C      5.0 (-0.5, 10.5)       0.07

Having created a table with more presentable content, what remains is to save it in a document. Textual R output can be written to a text file using the sink function, and figures can be saved in their own files, but for any non-trivial analysis, there will be a number of tables and figures that will need to be collected into a single document, ideally with explanatory text. The explanatory text, which would include table/figure captions and other notes describing the data or statistical methods, as well as section headers to help navigate larger documents, can be important for communicating and documenting the analysis. This type of document may need to be revised or re-compiled many times through the course of a project, so it’s best to be able to produce it entirely with code rather than manual assembly.

This is where R Markdown is indispensable. I rely on it so much that I’ve hardly given thought to how one might put together a report based on R output without using R Markdown. (I didn’t know about the sink function until just now.) For anyone not familiar with it, R Markdown provides a means for interweaving text such as this paragraph with blocks of R code, the output of which is incorporated into a document (usually PDF or HTML). This post was written with R Markdown.

When rendered using the knitr package’s kable function, the table above looks like this:

Variable Level Coefficient (95% CI) P-value
Sex Female
Male 4.9 (0.1, 9.6) 0.04
Age, per 10 years -5.8 (-7.9, -3.8) <0.0001
Treatment A
B 2.1 (-3.8, 8.0) 0.48
C 5.0 (-0.5, 10.5) 0.07

My intent with all this work creating a nicely formatted table is to reduce as much friction as possible in the transfer of information from the analysis output to the minds of the research team: eliminate extraneous information and symbols, and present the pertinent information in a familiar format — that is, in a format similar to what one would find in a medical journal article.

Another reason to take care with how I present my analysis results, even at an early stage, is that the investigators for whom I’m doing the analysis sometimes insert what I considered a preliminary table or figure into a manuscript draft without giving me much opportunity to revise it. It seems best to get those tables and figures in shape from the start, so that fewer difficulties with revision are encountered later.

Besides tables summarizing regression models, as in the example above, I commonly need to create tables that simply summarize data, as in the typical Table 1 of a medical journal article. For that, I came to rely on the arsenal package developed and maintained at Mayo Clinic. The tableby function does the job:

library(arsenal)
tableby(treatment ~ sex + age,
        total = FALSE,
        test = FALSE,
        data = patients) %>%
  summary(digits = 1)
A (N=34) B (N=29) C (N=37)
Sex
   Female 15 (44.1%) 10 (34.5%) 15 (40.5%)
   Male 19 (55.9%) 19 (65.5%) 22 (59.5%)
Age, years
   Mean (SD) 50.9 (10.2) 48.6 (14.4) 50.6 (9.5)
   Range 31.9 - 69.7 18.1 - 78.4 24.3 - 69.9

When it comes to creating figures (graphs, plots), it was valuable to learn the ways of ggplot2. It wasn’t easy to learn — it took at least a couple of years before I felt some fluency with all of the geom_s and so on — but it’s an investment that frequently pays dividends. Now that I’m comfortable with the syntax, the basic types of plot elements, and many of the ways they can be customized, I can create both routine and unconventional or complicated figures without excessive effort (“effort” meaning Googling), which means that I can more easily experiment with different ways to present data, and when the time comes in a project, I can create more sophisticated publication-quality figures.