Assigning after dplyr

James E. Pustejovsky

Hadley Wickham’s dplyr and tidyr packages completely changed the way I do data manipulation/munging in R. These packages make it possible to write shorter, faster, more legible, easier-to-intepret code to accomplish the sorts of manipulations that you have to do with practically any real-world data analysis. The legibility and interpretability benefits come from

using functions that are simple verbs that do exactly what they say (e.g., filter, summarize, group_by) and
chaining multiple operations together, through the pipe operator %>% from the magrittr package.

Chaining is particularly nice because it makes the code read like a story. For example, here’s the code to calculate sample means for the baseline covariates in a little experimental dataset I’ve been working with recently:

Code

library(dplyr)
dat <- read.csv("http://jepusto.com/data/Mineo_2009_data.csv")

dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean)) ->
  baseline_means

Each line of the code is a different action: first group the data by Condition, then select the relevant variables, then summarise each of the variables with its sample mean in each group. The results are stored in a dataset called baseline_means.

As I’ve gotten familiar with dplyr, I’ve adopted the style of using the backwards assignment operator (->) to store the results of a chain of manipulations. This is perhaps a little bit odd—in all the rest of my code I stick with the forward assignment operator (<-) with the object name on the left—but the alternative is to break the “flow” of the story, effectively putting the punchline before the end of the joke. Consider:

Code

baseline_means <- dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean))

That’s just confusing to me. So backward assignment operator it is.

Assigning as a verb

My only problem with this convention is that, with complicated chains of manipulations, I often find that I need to tweak the order of the verbs in the chain. For example, I might want to summarize all of the variables, and only then select which ones to store:

Code

dat %>%
  group_by(Condition) %>%
  summarise_each(funs(mean)) %>%
  select(Age, starts_with("Baseline")) ->
  baseline_means

In revising the code, it’s necessary to change the symbols at the end of the second and third steps, which is a minor hassle. It’s possible to do it by very carefully cutting-and-pasting the end of the second step through everything but the -> after the third step, but that’s a delicate operation, prone to error if you’re programming after hours or after beer. Wouldn’t it be nice if every step in the chain ended with %>% so that you could move around whole lines of code without worrying about the bit at the end?

Here’s one crude way to end each link in the chain with a pipe:

Code

dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean)) %>%
  identity() -> baseline_means

But this is still pretty ugly—it’s got an extra function call that’s not a verb, and the name of the resulting object is tucked away in the middle of a line. What I need is a verb to take the results of a chain of operations and assign to an object. Base R has a suitable candidate here: the assign function. How about the following?

Code

dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean)) %>%
  assign("baseline_means_new", .)

exists("baseline_means_new")

[1] FALSE

This doesn’t work because of some subtlety with the environment into which baseline_means_new is assigned. A brute-force fix would be to specify that the assign should be into the global environment. This will probably work 90%+ of the time, but it’s still not terribly elegant.

Here’s a function that searches the call stack to find the most recent invocation of itself that does not involve non-standard evaluation, then assigns to its parent environment:

Code

put <- function(x, name, where = NULL) {
  if (is.null(where)) {
    sys_calls <- sys.calls()
    put_calls <- grepl("\\<put\\(", sys_calls) & !grepl("\\<put\\(\\.",sys_calls)
    where <- sys.frame(max(which(put_calls)) - 1)
  }
  assign(name, value = x, pos = where)
}

Here are my quick tests that this function is assigning to the right environment:

Code

put(dat, "dat1")
dat %>% put("dat2")

f <- function(dat, name) {
  put(dat, "dat3")
  dat %>% put("dat4")
  put(dat, name)
  c(exists("dat3"), exists("dat4"), exists(name))
}

f(dat,"dat5")

[1] TRUE TRUE TRUE

Code

grep("dat",ls(), value = TRUE)

[1] "dat"  "dat1" "dat2"

This appears to work even if you’ve got multiple nested calls to put:

Code

put(f(dat, "dat6"), "dat7")
grep("dat",ls(), value = TRUE)

[1] "dat"  "dat1" "dat2" "dat7"

Code

dat7

[1] TRUE TRUE TRUE

Code

f(dat, "dat8") %>% put("dat9")
grep("dat",ls(), value = TRUE)

[1] "dat"  "dat1" "dat2" "dat7" "dat9"

Code

dat9

[1] TRUE TRUE TRUE

It works! (I think…)

To be consistent with the style of dplyr, let me also tweak the function to allow name to be the unquoted object name:

Code

put <- function(x, name, where = NULL) {
  name_string <- deparse(substitute(name))
  if (is.null(where)) {
    sys_calls <- sys.calls()
    put_calls <- grepl("\\<put\\(", sys_calls) & !grepl("\\<put\\(\\.",sys_calls)
    where <- sys.frame(max(which(put_calls)) - 1)
  }
  assign(name_string, value = x, pos = where)
}

Returning to my original chain of manipulations, here’s how it looks with the new function:

Code

dat %>%
  group_by(Condition) %>%
  select(Age, starts_with("Baseline")) %>%
  summarise_each(funs(mean)) %>%
  put(baseline_means_new)

print(baseline_means_new)

# A tibble: 3 × 4
  Condition   Age Baseline.Gaze Baseline.Vocalizations
  <chr>     <dbl>         <dbl>                  <dbl>
1 OtherVR    122.          91.9                   2.86
2 SelfVR     139.          95.5                   1.43
3 SelfVid    121.         102.                    1.86

If you’ve been following along, let me know what you think of this. Is it a good idea, or is it dangerous? Are there cases where this will break? Can you think of a better name?

--- title: Assigning after dplyr date: '2016-05-13' categories: - Rstats - programming code-tools: true --- Hadley Wickham's [dplyr](https://github.com/hadley/dplyr) and [tidyr](https://github.com/hadley/tidyr) packages completely changed the way I do data manipulation/munging in R. These packages make it possible to write shorter, faster, more legible, easier-to-intepret code to accomplish the sorts of manipulations that you have to do with practically any real-world data analysis. The legibility and interpretability benefits come from * using functions that are simple verbs that do exactly what they say (e.g., `filter`, `summarize`, `group_by`) and * chaining multiple operations together, through the pipe operator `%>%` from the [magrittr](https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html) package. Chaining is particularly nice because it makes the code read like a story. For example, here's the code to calculate sample means for the baseline covariates in a little experimental dataset I've been working with recently: ```{r, message = FALSE} library(dplyr) dat <- read.csv("http://jepusto.com/data/Mineo_2009_data.csv") dat %>% group_by(Condition) %>% select(Age, starts_with("Baseline")) %>% summarise_each(funs(mean)) -> baseline_means ``` Each line of the code is a different action: first group the data by `Condition`, then select the relevant variables, then summarise each of the variables with its sample mean in each group. The results are stored in a dataset called `baseline_means`. As I've gotten familiar with `dplyr`, I've adopted the style of using the backwards assignment operator (`->`) to store the results of a chain of manipulations. This is perhaps a little bit odd---in all the rest of my code I stick with the forward assignment operator (`<-`) with the object name on the left---but the alternative is to break the "flow" of the story, effectively putting the punchline before the end of the joke. Consider: ```{r} baseline_means <- dat %>% group_by(Condition) %>% select(Age, starts_with("Baseline")) %>% summarise_each(funs(mean)) ``` That's just confusing to me. So backward assignment operator it is. ### Assigning as a verb My only problem with this convention is that, with complicated chains of manipulations, I often find that I need to tweak the order of the verbs in the chain. For example, I might want to summarize _all_ of the variables, and only then select which ones to store: ```{r} dat %>% group_by(Condition) %>% summarise_each(funs(mean)) %>% select(Age, starts_with("Baseline")) -> baseline_means ``` In revising the code, it's necessary to change the symbols at the end of the second and third steps, which is a minor hassle. It's possible to do it by very carefully cutting-and-pasting the end of the second step through everything but the `->` after the third step, but that's a delicate operation, prone to error if you're programming after hours or after beer. Wouldn't it be nice if every step in the chain ended with `%>%` so that you could move around whole lines of code without worrying about the bit at the end? Here's one crude way to end each link in the chain with a pipe: ```{r} dat %>% group_by(Condition) %>% select(Age, starts_with("Baseline")) %>% summarise_each(funs(mean)) %>% identity() -> baseline_means ``` But this is still pretty ugly---it's got an extra function call that's not a verb, and the name of the resulting object is tucked away in the middle of a line. What I need is a verb to take the results of a chain of operations and assign to an object. Base R has a suitable candidate here: the `assign` function. How about the following? ```{r} dat %>% group_by(Condition) %>% select(Age, starts_with("Baseline")) %>% summarise_each(funs(mean)) %>% assign("baseline_means_new", .) exists("baseline_means_new") ``` This doesn't work because of some subtlety with the environment into which `baseline_means_new` is assigned. A brute-force fix would be to specify that the assign should be into the global environment. This will probably work 90%+ of the time, but it's still not terribly elegant. Here's a function that searches the call stack to find the most recent invocation of itself that does not involve non-standard evaluation, then assigns to its parent environment: ```{r} put <- function(x, name, where = NULL) { if (is.null(where)) { sys_calls <- sys.calls() put_calls <- grepl("\\<put\\(", sys_calls) & !grepl("\\<put\\(\\.",sys_calls) where <- sys.frame(max(which(put_calls)) - 1) } assign(name, value = x, pos = where) } ``` Here are my quick tests that this function is assigning to the right environment: ```{r} put(dat, "dat1") dat %>% put("dat2") f <- function(dat, name) { put(dat, "dat3") dat %>% put("dat4") put(dat, name) c(exists("dat3"), exists("dat4"), exists(name)) } f(dat,"dat5") grep("dat",ls(), value = TRUE) ``` This appears to work even if you've got multiple nested calls to `put`: ```{r} put(f(dat, "dat6"), "dat7") grep("dat",ls(), value = TRUE) dat7 f(dat, "dat8") %>% put("dat9") grep("dat",ls(), value = TRUE) dat9 ``` ### It works! (I think...) To be consistent with the style of dplyr, let me also tweak the function to allow `name` to be the unquoted object name: ```{r} put <- function(x, name, where = NULL) { name_string <- deparse(substitute(name)) if (is.null(where)) { sys_calls <- sys.calls() put_calls <- grepl("\\<put\\(", sys_calls) & !grepl("\\<put\\(\\.",sys_calls) where <- sys.frame(max(which(put_calls)) - 1) } assign(name_string, value = x, pos = where) } ``` Returning to my original chain of manipulations, here's how it looks with the new function: ```{r} dat %>% group_by(Condition) %>% select(Age, starts_with("Baseline")) %>% summarise_each(funs(mean)) %>% put(baseline_means_new) print(baseline_means_new) ``` If you've been following along, let me know what you think of this. Is it a good idea, or is it dangerous? Are there cases where this will break? Can you think of a better name?