purrr Like a Kitten till the Lake Pipes RoaR

I really should make a minimal effort to resist opening a data analysis blog post with Beach Boys’ lyrics, but this time the combination is too apt. We use the purrr package to show how to let your pipes roar in R.

The tidyverse GitHub site contains a simple example illustrating how well pipes and purrr work together. For more learning, try Jenny Bryan’s purrr tutorial.

First, load the tidyverse and the purrr package.

library(tidyverse)
library(purrr)
#devtools::install_github("jennybc/repurrrsive")
#library(repurrrsive)


If you want to be more adventurous, you can download Jenny’s repurrrsive package from GitHub. The code is hashed out in the chunk above.

You Don’t Know What I Got

The great thing about using pipes is you can tie together a series of steps to produce a single output of exactly what you want. In this case, we are going to start with the base R dataset airquality and apply a number of functions to come up with the adjusted R-squared value for the Ozone data for each month between May and September. Along the way we will run a linear regression on the data to generate the adjusted R-squared values.

Here is what the full process looks like, from beginning to end.

airquality %>% 
         split(.$Month) %>% 
         map(~ lm(Ozone ~ Temp, data = .)) %>% 
         map(summary) %>% 
         map_dbl('adj.r.squared')

     5      6      7      8      9 
0.2781 0.3676 0.5024 0.3307 0.6742

The problem, of course, is the output generated by the intermediate steps stays behind the scenes. For a beginner, this can be a bit confusing because it isn’t clear what is going on. So let’s break the full chunk into its constituent pieces so we can see what purrr is doing and how pipes tie the whole thing together.

Start at the beginning and take things step-by-step.

Run the airquality data set to see what we are dealing with. We can see the data is collected daily for five months. Ozone looks to be the target, so it is natural to wonder if there is any relationship between it and the other variables.

airquality

    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
2      36     118  8.0   72     5   2
3      12     149 12.6   74     5   3
4      18     313 11.5   62     5   4
5      NA      NA 14.3   56     5   5
6      28      NA 14.9   66     5   6
7      23     299  8.6   65     5   7
8      19      99 13.8   59     5   8
9       8      19 20.1   61     5   9
10     NA     194  8.6   69     5  10
11      7      NA  6.9   74     5  11
12     16     256  9.7   69     5  12
13     11     290  9.2   66     5  13
14     14     274 10.9   68     5  14
15     18      65 13.2   58     5  15
16     14     334 11.5   64     5  16
17     34     307 12.0   66     5  17
18      6      78 18.4   57     5  18
19     30     322 11.5   68     5  19
20     11      44  9.7   62     5  20
21      1       8  9.7   59     5  21
22     11     320 16.6   73     5  22
23      4      25  9.7   61     5  23
24     32      92 12.0   61     5  24
25     NA      66 16.6   57     5  25
26     NA     266 14.9   58     5  26
27     NA      NA  8.0   57     5  27
28     23      13 12.0   67     5  28
29     45     252 14.9   81     5  29
30    115     223  5.7   79     5  30
31     37     279  7.4   76     5  31
32     NA     286  8.6   78     6   1
33     NA     287  9.7   74     6   2
34     NA     242 16.1   67     6   3
35     NA     186  9.2   84     6   4
36     NA     220  8.6   85     6   5
37     NA     264 14.3   79     6   6
38     29     127  9.7   82     6   7
39     NA     273  6.9   87     6   8
40     71     291 13.8   90     6   9
41     39     323 11.5   87     6  10
42     NA     259 10.9   93     6  11
43     NA     250  9.2   92     6  12
44     23     148  8.0   82     6  13
45     NA     332 13.8   80     6  14
46     NA     322 11.5   79     6  15
47     21     191 14.9   77     6  16
48     37     284 20.7   72     6  17
49     20      37  9.2   65     6  18
50     12     120 11.5   73     6  19
51     13     137 10.3   76     6  20
52     NA     150  6.3   77     6  21
53     NA      59  1.7   76     6  22
54     NA      91  4.6   76     6  23
55     NA     250  6.3   76     6  24
56     NA     135  8.0   75     6  25
57     NA     127  8.0   78     6  26
58     NA      47 10.3   73     6  27
59     NA      98 11.5   80     6  28
60     NA      31 14.9   77     6  29
61     NA     138  8.0   83     6  30
62    135     269  4.1   84     7   1
63     49     248  9.2   85     7   2
64     32     236  9.2   81     7   3
65     NA     101 10.9   84     7   4
66     64     175  4.6   83     7   5
67     40     314 10.9   83     7   6
68     77     276  5.1   88     7   7
69     97     267  6.3   92     7   8
70     97     272  5.7   92     7   9
71     85     175  7.4   89     7  10
72     NA     139  8.6   82     7  11
73     10     264 14.3   73     7  12
74     27     175 14.9   81     7  13
75     NA     291 14.9   91     7  14
76      7      48 14.3   80     7  15
77     48     260  6.9   81     7  16
78     35     274 10.3   82     7  17
79     61     285  6.3   84     7  18
80     79     187  5.1   87     7  19
81     63     220 11.5   85     7  20
82     16       7  6.9   74     7  21
83     NA     258  9.7   81     7  22
84     NA     295 11.5   82     7  23
85     80     294  8.6   86     7  24
86    108     223  8.0   85     7  25
87     20      81  8.6   82     7  26
88     52      82 12.0   86     7  27
89     82     213  7.4   88     7  28
90     50     275  7.4   86     7  29
91     64     253  7.4   83     7  30
92     59     254  9.2   81     7  31
93     39      83  6.9   81     8   1
94      9      24 13.8   81     8   2
95     16      77  7.4   82     8   3
96     78      NA  6.9   86     8   4
97     35      NA  7.4   85     8   5
98     66      NA  4.6   87     8   6
99    122     255  4.0   89     8   7
100    89     229 10.3   90     8   8
101   110     207  8.0   90     8   9
102    NA     222  8.6   92     8  10
103    NA     137 11.5   86     8  11
104    44     192 11.5   86     8  12
105    28     273 11.5   82     8  13
106    65     157  9.7   80     8  14
107    NA      64 11.5   79     8  15
108    22      71 10.3   77     8  16
109    59      51  6.3   79     8  17
110    23     115  7.4   76     8  18
111    31     244 10.9   78     8  19
112    44     190 10.3   78     8  20
113    21     259 15.5   77     8  21
114     9      36 14.3   72     8  22
115    NA     255 12.6   75     8  23
116    45     212  9.7   79     8  24
117   168     238  3.4   81     8  25
118    73     215  8.0   86     8  26
119    NA     153  5.7   88     8  27
120    76     203  9.7   97     8  28
121   118     225  2.3   94     8  29
122    84     237  6.3   96     8  30
123    85     188  6.3   94     8  31
124    96     167  6.9   91     9   1
125    78     197  5.1   92     9   2
126    73     183  2.8   93     9   3
127    91     189  4.6   93     9   4
128    47      95  7.4   87     9   5
129    32      92 15.5   84     9   6
130    20     252 10.9   80     9   7
131    23     220 10.3   78     9   8
132    21     230 10.9   75     9   9
133    24     259  9.7   73     9  10
134    44     236 14.9   81     9  11
135    21     259 15.5   76     9  12
136    28     238  6.3   77     9  13
137     9      24 10.9   71     9  14
138    13     112 11.5   71     9  15
139    46     237  6.9   78     9  16
140    18     224 13.8   67     9  17
141    13      27 10.3   76     9  18
142    24     238 10.3   68     9  19
143    16     201  8.0   82     9  20
144    13     238 12.6   64     9  21
145    23      14  9.2   71     9  22
146    36     139 10.3   81     9  23
147     7      49 10.3   69     9  24
148    14      20 16.6   63     9  25
149    30     193  6.9   70     9  26
150    NA     145 13.2   77     9  27
151    14     191 14.3   75     9  28
152    18     131  8.0   76     9  29
153    20     223 11.5   68     9  30

The first step is to break the data up by month, so we make use of base R‘s split() function. Notice all the data is grouped by month.

airquality %>% 
    split(.$Month)

$`5`
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    NA     194  8.6   69     5  10
11     7      NA  6.9   74     5  11
12    16     256  9.7   69     5  12
13    11     290  9.2   66     5  13
14    14     274 10.9   68     5  14
15    18      65 13.2   58     5  15
16    14     334 11.5   64     5  16
17    34     307 12.0   66     5  17
18     6      78 18.4   57     5  18
19    30     322 11.5   68     5  19
20    11      44  9.7   62     5  20
21     1       8  9.7   59     5  21
22    11     320 16.6   73     5  22
23     4      25  9.7   61     5  23
24    32      92 12.0   61     5  24
25    NA      66 16.6   57     5  25
26    NA     266 14.9   58     5  26
27    NA      NA  8.0   57     5  27
28    23      13 12.0   67     5  28
29    45     252 14.9   81     5  29
30   115     223  5.7   79     5  30
31    37     279  7.4   76     5  31

$`6`
   Ozone Solar.R Wind Temp Month Day
32    NA     286  8.6   78     6   1
33    NA     287  9.7   74     6   2
34    NA     242 16.1   67     6   3
35    NA     186  9.2   84     6   4
36    NA     220  8.6   85     6   5
37    NA     264 14.3   79     6   6
38    29     127  9.7   82     6   7
39    NA     273  6.9   87     6   8
40    71     291 13.8   90     6   9
41    39     323 11.5   87     6  10
42    NA     259 10.9   93     6  11
43    NA     250  9.2   92     6  12
44    23     148  8.0   82     6  13
45    NA     332 13.8   80     6  14
46    NA     322 11.5   79     6  15
47    21     191 14.9   77     6  16
48    37     284 20.7   72     6  17
49    20      37  9.2   65     6  18
50    12     120 11.5   73     6  19
51    13     137 10.3   76     6  20
52    NA     150  6.3   77     6  21
53    NA      59  1.7   76     6  22
54    NA      91  4.6   76     6  23
55    NA     250  6.3   76     6  24
56    NA     135  8.0   75     6  25
57    NA     127  8.0   78     6  26
58    NA      47 10.3   73     6  27
59    NA      98 11.5   80     6  28
60    NA      31 14.9   77     6  29
61    NA     138  8.0   83     6  30

$`7`
   Ozone Solar.R Wind Temp Month Day
62   135     269  4.1   84     7   1
63    49     248  9.2   85     7   2
64    32     236  9.2   81     7   3
65    NA     101 10.9   84     7   4
66    64     175  4.6   83     7   5
67    40     314 10.9   83     7   6
68    77     276  5.1   88     7   7
69    97     267  6.3   92     7   8
70    97     272  5.7   92     7   9
71    85     175  7.4   89     7  10
72    NA     139  8.6   82     7  11
73    10     264 14.3   73     7  12
74    27     175 14.9   81     7  13
75    NA     291 14.9   91     7  14
76     7      48 14.3   80     7  15
77    48     260  6.9   81     7  16
78    35     274 10.3   82     7  17
79    61     285  6.3   84     7  18
80    79     187  5.1   87     7  19
81    63     220 11.5   85     7  20
82    16       7  6.9   74     7  21
83    NA     258  9.7   81     7  22
84    NA     295 11.5   82     7  23
85    80     294  8.6   86     7  24
86   108     223  8.0   85     7  25
87    20      81  8.6   82     7  26
88    52      82 12.0   86     7  27
89    82     213  7.4   88     7  28
90    50     275  7.4   86     7  29
91    64     253  7.4   83     7  30
92    59     254  9.2   81     7  31

$`8`
    Ozone Solar.R Wind Temp Month Day
93     39      83  6.9   81     8   1
94      9      24 13.8   81     8   2
95     16      77  7.4   82     8   3
96     78      NA  6.9   86     8   4
97     35      NA  7.4   85     8   5
98     66      NA  4.6   87     8   6
99    122     255  4.0   89     8   7
100    89     229 10.3   90     8   8
101   110     207  8.0   90     8   9
102    NA     222  8.6   92     8  10
103    NA     137 11.5   86     8  11
104    44     192 11.5   86     8  12
105    28     273 11.5   82     8  13
106    65     157  9.7   80     8  14
107    NA      64 11.5   79     8  15
108    22      71 10.3   77     8  16
109    59      51  6.3   79     8  17
110    23     115  7.4   76     8  18
111    31     244 10.9   78     8  19
112    44     190 10.3   78     8  20
113    21     259 15.5   77     8  21
114     9      36 14.3   72     8  22
115    NA     255 12.6   75     8  23
116    45     212  9.7   79     8  24
117   168     238  3.4   81     8  25
118    73     215  8.0   86     8  26
119    NA     153  5.7   88     8  27
120    76     203  9.7   97     8  28
121   118     225  2.3   94     8  29
122    84     237  6.3   96     8  30
123    85     188  6.3   94     8  31

$`9`
    Ozone Solar.R Wind Temp Month Day
124    96     167  6.9   91     9   1
125    78     197  5.1   92     9   2
126    73     183  2.8   93     9   3
127    91     189  4.6   93     9   4
128    47      95  7.4   87     9   5
129    32      92 15.5   84     9   6
130    20     252 10.9   80     9   7
131    23     220 10.3   78     9   8
132    21     230 10.9   75     9   9
133    24     259  9.7   73     9  10
134    44     236 14.9   81     9  11
135    21     259 15.5   76     9  12
136    28     238  6.3   77     9  13
137     9      24 10.9   71     9  14
138    13     112 11.5   71     9  15
139    46     237  6.9   78     9  16
140    18     224 13.8   67     9  17
141    13      27 10.3   76     9  18
142    24     238 10.3   68     9  19
143    16     201  8.0   82     9  20
144    13     238 12.6   64     9  21
145    23      14  9.2   71     9  22
146    36     139 10.3   81     9  23
147     7      49 10.3   69     9  24
148    14      20 16.6   63     9  25
149    30     193  6.9   70     9  26
150    NA     145 13.2   77     9  27
151    14     191 14.3   75     9  28
152    18     131  8.0   76     9  29
153    20     223 11.5   68     9  30

In the second step, we apply the purrr map() function to the linear regression model we create with lm(). We want the adjusted R-squared value for each month for Ozone. I played around with the variables a bit to find which one illustrated the adjusted R-squared values best and settled on Temp, but you can choose any other besides Month and Day.

The map() command applies the lm() function to each monthy group and yields the typical output for each month. We now have five linear regression models, one for each month, but no adjusted R-squared values.

airquality %>% 
    split(.$Month) %>% 
    map(~ lm(Ozone ~ Temp, data = .))

$`5`

Call:
lm(formula = Ozone ~ Temp, data = .)

Coefficients:
(Intercept)         Temp  
    -102.16         1.88  


$`6`

Call:
lm(formula = Ozone ~ Temp, data = .)

Coefficients:
(Intercept)         Temp  
     -91.99         1.55  


$`7`

Call:
lm(formula = Ozone ~ Temp, data = .)

Coefficients:
(Intercept)         Temp  
    -372.92         5.15  


$`8`

Call:
lm(formula = Ozone ~ Temp, data = .)

Coefficients:
(Intercept)         Temp  
    -238.86         3.56  


$`9`

Call:
lm(formula = Ozone ~ Temp, data = .)

Coefficients:
(Intercept)         Temp  
    -149.35         2.35  

To generate the adjusted R-squared values, we need to map() the summary() command to each group. This is the third step. Again, the results are familiar. We have the typical summary of the linear model, one for each month.

airquality %>% 
    split(.$Month) %>% 
    map(~ lm(Ozone ~ Temp, data = .)) %>%
    map(summary)

$`5`

Call:
lm(formula = Ozone ~ Temp, data = .)

Residuals:
   Min     1Q Median     3Q    Max 
-30.32  -8.62  -2.41   5.32  68.26 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -102.159     38.750   -2.64   0.0145 * 
Temp           1.885      0.578    3.26   0.0033 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.9 on 24 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.307, Adjusted R-squared:  0.278 
F-statistic: 10.6 on 1 and 24 DF,  p-value: 0.00331


$`6`

Call:
lm(formula = Ozone ~ Temp, data = .)

Residuals:
   Min     1Q Median     3Q    Max 
-12.99  -9.34  -6.31  11.08  23.27 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -91.991     51.312   -1.79    0.116  
Temp           1.552      0.653    2.38    0.049 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.5 on 7 degrees of freedom
  (21 observations deleted due to missingness)
Multiple R-squared:  0.447, Adjusted R-squared:  0.368 
F-statistic: 5.65 on 1 and 7 DF,  p-value: 0.0491


$`7`

Call:
lm(formula = Ozone ~ Temp, data = .)

Residuals:
   Min     1Q Median     3Q    Max 
-32.11 -14.52  -1.16   7.58  75.29 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -372.92      84.45   -4.42  0.00018 ***
Temp            5.15       1.01    5.12    3e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.3 on 24 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.522, Adjusted R-squared:  0.502 
F-statistic: 26.2 on 1 and 24 DF,  p-value: 3.05e-05


$`8`

Call:
lm(formula = Ozone ~ Temp, data = .)

Residuals:
   Min     1Q Median     3Q    Max 
-40.42 -17.65  -8.07   9.97 118.58 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -238.861     82.023   -2.91   0.0076 **
Temp           3.559      0.974    3.65   0.0013 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 32.5 on 24 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.357, Adjusted R-squared:  0.331 
F-statistic: 13.4 on 1 and 24 DF,  p-value: 0.00126


$`9`

Call:
lm(formula = Ozone ~ Temp, data = .)

Residuals:
   Min     1Q Median     3Q    Max 
-27.45  -8.59  -3.69  11.04  31.39 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -149.347     23.688   -6.30  9.5e-07 ***
Temp           2.351      0.306    7.68  2.9e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13.8 on 27 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.686, Adjusted R-squared:  0.674 
F-statistic: 58.9 on 1 and 27 DF,  p-value: 2.95e-08

She Blows em Outta the Water Like You Never Seen

Now things get interesting, because we can put it all together.

Step 4 involves using the specialized map_dbl() command to pull the adjusted R-squared values from each month’s linear model summary and output them in a single line.

How do we know the adjusted R-squared value is a double? Well, it looks like one since it consists of a floating decimal. But we could guess, too. If we were to use the map_int() command we would get an error that tells us the value is a double. If we guessed character we could use mapZ_chr. That would work but we would have the output in character form, which isn’t what we want.

So we can recognize its data form or we can figure it out through trial and error. Either way we end up at the same place, back where we started.

airquality %>% 
    split(.$Month) %>% 
    map(~ lm(Ozone ~ Temp, data = .)) %>% 
    map(summary) %>% 
    map_dbl('adj.r.squared')

     5      6      7      8      9 
0.2781 0.3676 0.5024 0.3307 0.6742

And if That Ain’t Enough to Make You Flip Your Lid

That’s one more thing I’ve got to rethink, daddy.

UPDATE, 3 May 2018

Fellow R blogger Chuck Powell suggests this neat twist on the theme, which yields a more complete statistical table, changing only the last line. I like it. Thanks Chuck!

airquality %>% 
    split(.$Month) %>% 
    map(~ lm(Ozone ~ Temp, data = .)) %>% 
    map(summary) %>% 
    map_dfr(~ broom::glance(.), .id = "Month")

  Month r.squared adj.r.squared sigma statistic   p.value df
1     5    0.3070        0.2781 18.88    10.632 3.315e-03  2
2     6    0.4467        0.3676 14.48     5.651 4.909e-02  2
3     7    0.5223        0.5024 22.32    26.241 3.048e-05  2
4     8    0.3575        0.3307 32.46    13.353 1.256e-03  2
5     9    0.6858        0.6742 13.78    58.942 2.945e-08  2

Coming to $terms in R

A recent analysis I worked on involved building a log regression and some ensemble methods using a data set with about 25 features, in addition to the target. It was an analysis of customer churn in the telecom industry. If you are interested, you can find the problem statement here, the annotated code here, and the raw code in my GitHub repository.

The problem I ran into came up later, when I wanted to reproduce a step in this analysis. I forgot the R command I used! @DarrinLRogers came through with the reminder so I thought I should write it all down in a post. You know, for next time I forget.

Working with Regression Models

I was doing some feature selection and feature engineering, so I expected to reduce the number of features significantly. Certain features were highly correlated, and there was also some redundancy, both of which meant pruning back the number of features. The number increased to 32, by the way, because almost all features had to be replaced with dummy variables.

I took advantage of the neat shortcut in R that allows you to use the Y ~ ., argument to pass all the features into a regression function. I was using glm() but it is the same syntax for lm(). The code looked like this:

LogModel2 <- glm(Churn ~ ., family = binomial(link = 'logit'), data = training)

Running the summary() command on a glm or lm object gives an indication of the relative importance of the features in the data set, which are now terms in the regression equation. Importance is given by the number of asterisks to the right of the column of p-values. You can use these to determine which terms to prune from the equation.

A similar selection analysis can be done running the anova() command. The results won’t necessarily be the same as the summary() command’s so comparisons can be helpful.

About 12 terms had to be cut out of the regression model. They weren’t adding much to performance. Having a simpler model is best, so cutting out these features made sense. But with that many to remove, rewriting the glm() command gets kind of tedious. Fortunately, I came across an R tip I found very useful. It made it possible to output the full regression model with all terms expressed, not abbreviated in the Y ~ ., syntax.

All I had to do was copy the full equation and remove the terms I didn’t need. I ended up with this:

LogModel2 <- glm(Churn ~ TotalCharges + PhoneService_Yes + MultipleLines_Yes + InternetService_DSL + 
                         `InternetService_Fiber optic` + OnlineSecurity_Yes + 
                         TechSupport_Yes + StreamingTV_Yes + StreamingMovies_Yes + 
                         `Contract_Month-to-month` + `Contract_One year` + PaperlessBilling_Yes +
                         `PaymentMethod_Electronic check` + `tenure_group_0 - 6 Mos`,
                         family = binomial(link = 'logit'), 
                         data = training)

If there is another way of doing this (and I’m sure there is) I would like to hear about it.

The call that yields this result is, in this case LogModel2$terms. It produces a lot of information about the object but the output of the full equation with all terms was most useful in this analysis.

The Takeaway

Later, I wanted to do something similar with another project I was working on, but I forgot how I did it the first time! I drew a total blank. I spent hours googling and going through my history in R-Studio but I couldn’t find the code. It was terribly frustrating.

Finally, in desperation, I turned to the #rstats channel in Twitter and the R4ds Slack group. It was a mistake. I should have gone there first.

I got a lot of good responses and helpful ideas on both channels but none was exactly what I was looking for. I realized this object$terms call might not be well known in the community. So I vowed when I figured it out, or if someone pointed me in the right direction, I would pass along the knowledge in a blog post.

Fortunately, @DarrinLRogers came through with the solution I was looking for. Thanks to him and all the others who pitched in with good ideas.

You can learn more about the R4ds Slack channel from Jesse Meagan. Sign up for R4ds Slack here.

Mission accomplished.

To Loop or Not to Loop?

The question of whether to loop in R or not, or what are the appropriate circumstances which would lead someone to loop in R, has strong proponents on both sides. Some argue loops are useful in R and there is nothing wrong with using them. On the other side are those who refuse to use loops at all because the loop structure is contrary to R‘s vectorized approach.

This one, a tutorial on how to use loops in R, manages to be on both sides at once. It advises, “after you have gotten a clear understanding of loops, get rid of them.” It comes down on the side of the vector: “Put your effort into learning about vectorized alternatives. It pays off in terms of efficiency.”

In a 2014 blog post, “Vectorization in R: Why?”, Noam Ross wrote it sometimes “may make sense to use a for loop, especially if they are more intuitive or easier to read for you” (emphasis in the original). His post is a good explanation of how vectorization works in R and why.

Jenny Bryan of the University of British Columbia posted an excellent speaker deck on row-wise operations in R. One slide makes the plea, “Of course someone has to write loops. It doesn’t have to be you.”

With the Vector People

Full disclosure: my position is with the vector people. I learned R from those who believe loops aren’t necessary and that a vector solution is always preferable. It works pretty well for me and I have gone a long while without writing a loop.

Until now.

I decided to challenge myself recently when a small project landed on my desk. It involved creating a pivot table using data from some utility customers. One chunk of code would have been a good candidate for a loop, but I stuck to my principles and simply wrote it out in a straightforward, line by line manner. The code worked fine and it produced the desired pivot table.

Someone saw the code and commented that I really should have used a loop there. I said I thought about it, but I try to avoid loops at all costs so I went with a more copy-and-paste approach. His problem was I violated the DRY rule (Don’t Repeat Yourself) by copying the original command 23 times (24 lines in all) and adjusting the arguments slightly in each subsequent line. That’s true. Everything is a trade-off.

I was editing a character string. I had to delete all the characters but one in a four-letter code, then I had to convert the surviving letter to one of three words, depending on what the character was. So what started as a code ended up as a word. I had to do this for three data files, each with about 50,000 rows.

Putting Intuition to the Test

Frankly, I didn’t know offhand how to get R to delete in one shot the three characters I wanted gone and then convert the remaining character into the appropriate word. I knew I could do the job a step at a time with gsub(). There would be minimal Googling and searching StackOverflow and I would have a straight-line vector result. So not using a loop looked like the best use of my time, despite the DRY violation.

Yesterday, I decided to go back and test my intuition. So I recreated my effort to write the long code.

I took five minutes to Google a loop solution for the problem and went through StackOverflow looking for examples. I couldn’t find anything I could work with. I read the gsub() help page on R to make sure I got the order of arguments right. That took another five minutes. Then I sat down to produce the 24 lines of code, copy-and-paste style. That took another five minutes. It was tedious, sure, but it didn’t take long. Just 15 minutes and I was done.

Here is what the code looks like in the original form:

    FY15W$rate_code <- gsub('W', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('1', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('2', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('3', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('4', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('R', 'Residential', FY15W$rate_code)
    FY15W$rate_code <- gsub('C', 'Commercial', FY15W$rate_code)
    FY15W$rate_code <- gsub('M', 'Multi-Family', FY15W$rate_code)
    FY16W$rate_code <- gsub('W', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('1', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('2', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('3', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('4', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('R', 'Residential', FY16W$rate_code)
    FY16W$rate_code <- gsub('C', 'Commercial', FY16W$rate_code)
    FY16W$rate_code <- gsub('M', 'Multi-Family', FY16W$rate_code)
    FY17W$rate_code <- gsub('W', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('1', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('2', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('3', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('4', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('R', 'Residential', FY17W$rate_code)
    FY17W$rate_code <- gsub('C', 'Commercial', FY17W$rate_code)
    FY17W$rate_code <- gsub('M', 'Multi-Family', FY17W$rate_code)

It is pretty repetitive and was tedious to put together. But it works fine.

Since this was a recreation of my first effort, let’s double that 15 minutes and say it took me half an hour the first time. I had to check for errors, fix typos, and test the code to make sure it worked properly, so 30 minutes isn’t a bad estimate.

Building the Loop

Then I went to build a loop that would have the same functionality. Here is the code for the loop:

    for (i in seq_along(WaterData$Class)) {
            WaterData$Class <- gsub('W', '', WaterData$Class)
            WaterData$Class <- gsub('1', '', WaterData$Class)
            WaterData$Class <- gsub('2', '', WaterData$Class)
            WaterData$Class <- gsub('3', '', WaterData$Class)
            WaterData$Class <- gsub('4', '', WaterData$Class)
            WaterData$Class <- gsub('R', 'Residential', WaterData$Class)
            WaterData$Class <- gsub('C', 'Commercial', WaterData$Class)
            WaterData$Class <- gsub('M', 'Multi-Family', WaterData$Class)
            break
    }

My intuition was right. It took me about 1 hour and 20 minutes. The loop is ten lines of code, 60% fewer lines, but it took three times longer to create. So the loop was not more efficient in terms of time spent creating the code. I have to admit it is prettier and more readable, but this is throwaway code for a one-shot analysis so those weren’t high on my list of priorities.

Break Time

I had to spend a good part of that time debugging the loop because it didn’t work the way I expected right out of the box. It deleted the unwanted characters just fine but when it came to replacing a character with a word, the loop worked too well. Instead of replacing the remaining C with the word Commercial, our loop friend looped back to the C in “Commercial” to produced “Commercialommercial” and then back again to yield “Commercialommercialommercial” and so on until I had a string of clipped “Commercial” nine long!

I realized in a bit I had to insert a break to shut down the loop. Then it worked beautifully and quickly. But this illustrates really well the problem of using loops in a vector environment. They just don’t work how they should. And they can take more time to create than a simple, straightforward, non-loopy solution.

Update (18 Apr 18)

A much better, non-loop solution, was suggested by Chuck Powers, who writes an informative blog. Chuck points out that the stringr package, part of the tidyverse, includes a function called str_replace_all(). Sure enough, I used followed his adviced and reduced the 24-line chunk above to these six lines:

FY15W$rate_code <- str_replace_all(FY15W$rate_code, '[W1234]', '')
FY15W$rate_code <- str_replace_all(FY15W$rate_code, c('R'= 'Residential', 'C' = 'Commercial', 'M' = 'Multi-Family'))
FY16W$rate_code <- str_replace_all(FY16W$rate_code, '[W1234]', '')
FY16W$rate_code <- str_replace_all(FY16W$rate_code, c('R'= 'Residential', 'C' = 'Commercial', 'M' = 'Multi-Family'))
FY17W$rate_code <- str_replace_all(FY17W$rate_code, '[W1234]', '')
FY17W$rate_code <- str_replace_all(FY17W$rate_code, c('R'= 'Residential', 'C' = 'Commercial', 'M' = 'Multi-Family'))

All without looping! Thanks to Chuck for the tip.

Python’s Keras Library in R, Part 2

So I learned in the previous post that if an R user wants to load the Python keras library into R to run neural net models, it is necessary to load Python first. The keras package in R is an interface with Python, not a standalone package.

That’s fine, but it would have been nice to know beforehand. So I thought I should write it down for others.

Loaded Anaconda 3 Earlier

Fortunately, I loaded Anaconda 3 into my system earlier this year in preparation for a program at UCLA on data science. We have been using Jupyter Notebooks and, lately, Jupyter Lab to run both R and Python code, so I have much of the Python infrastructure set up. Anaconda 3 is a pretty easy installation, though it does take some time due to its size. It is a good place to start if you need a Python environment.

Anaconda Navigator is the main package in the Anaconda 3 suite, and it comes with a version of R Studio. I can’t say anything about that, but some users might find it convenient to have both Python and R Studio in the same software suite.

If you are working in Notebook or Lab, Docker is another useful program to have running on your system. You have to access and operate it through the shell. A thorough treatment of Docker and all its intricacies can be found in Joshua Cook’s book, Docker for Data Science: Building Scalable and Extensible Data Infrastructure Around the Jupyter Notebook Server, available from Apress.

One way or another, however you work with Python, setting it up is necessary for loading the keras package into R. And since keras works with TensorFlow you will need to load the R library tensorflow, as well, but that should not be too demanding.

Easiest to Work with Python in the Command Window

Once you have set up Anaconda 3, it is probably easiest to work with Python through the command window, Anaconda Prompt. The keras package does not come preloaded in Anaconda, so you have to install it. I found on Git the code making this possible, and if you are not familiar with Python it might be easiest just to follow this approach.

At the command line in Anaconda Prompt, you need to enter:

pip install keras.

Then import keras.

That’s it.

Though I don’t have personal experience using it, another method I understand works, which you should try only if the previous command does not help, is to enter:

sudo pip3 install keras.

A common mistake is to enter instead:

conda install keras

So try to avoid that.

It takes a while to load keras into Python, so be patient and enjoy watching the forward slash spin around while the program does its thing.

Good to Go

Once you have keras loaded, go back to your R environment and install and load the CRAN version of the library.

You should be good to go.

The Pyimagesearch blog has a good post on the keras installation in much more detail if you are interested. I didn’t find it necessary to carry out the steps described there for editing the keras.json config file, setting up GPU support, or accessing OpenCV bindings, but if you do this is a good reference.

We can talk about the R interface with TensorFlow some other time.

Loading Python’s Keras into R, Part 1

Late last year, Matt Dancho had a post on deep learning celebrating the arrival of the Python keras package for R. It is a very good tutorial on using artificial neural networks (ANN) to solve complicated business problems, well worth checking out.

Took More Doing Than I Thought

I started working with neural networks over a decade ago with Palisade Decision Tree software, which includes NeuralTools, a neural network add-in for Excel. It’s a quality program that works well, but it is subject to constraints imposed by Excel. So I looked forward to playing around with keras and getting a sense of how R works with neural nets.

What I didn’t know is that in order to use keras in R it is necessary to have the keras Python library loaded and ready to go. This took more doing than I thought it would.

Of course, R has native neural network and deep learning packages, such as nnet and RSNNS, among others. But the idea of R joining forces with Python to implement a keras package is a welcome addition and one I wanted to try. I went through the R-Studio cheat sheet on keras and decided to make a go.

Straight to GTS Mode

Things went smoothly until I got to actually building and running the keras model. I was immediately faced with a long list of warnings followed by the failure of the model to run. I ran the code a couple more times to see if I could figure out what was going on. Each time, the same warnings popped up.

In looking closely at the warnings I finally noticed, buried among them towards the bottom, this error message:

ModuleNotFoundError: No module named 'keras'

I checked to make sure the keras library was loaded in my environment and running. It was. A lesson from a long ago data science class came to mind and I went straight to GTS mode. All I could find were references to keras in Python. There was nothing about this error message in R.

GitHub was the most help. There I found a thread on “No module named keras: #4889”. But it was short and was closed down due to lack of use in late 2017.

That thread contained a few snippets of Python code that helped me figure out the problem. For keras to run in R you need to have keras loaded in Python. Which means you need to have Anaconda Prompt or JupyterLab loaded in your system, as well as R.

Lesson Learned

This was news to me. It’s not mentioned in the keras cheat sheet or in Matt’s blog post.

In fact, the keras cheat sheet mentions in the “Installation” section that “the keras R package uses the Python keras library. You can install all the prerequisites directly from R.”

That wasn’t the case for me. There’s a note that says “See ?keras_install for GPU instructions,” but when I run the command I get “No results found.”

I guess it is common knowledge, but somehow I did not get the memo. Many others are probably unaware. Hence, this post.

The lesson here is read the documentation. Keras in R is the interface to Python’s keras. No Python, no keras in R.

More on this is the next post.