To Loop or Not to Loop?

The question of whether to loop in R or not, or what are the appropriate circumstances which would lead someone to loop in R, has strong proponents on both sides. Some argue loops are useful in R and there is nothing wrong with using them. On the other side are those who refuse to use loops at all because the loop structure is contrary to R‘s vectorized approach.

This one, a tutorial on how to use loops in R, manages to be on both sides at once. It advises, “after you have gotten a clear understanding of loops, get rid of them.” It comes down on the side of the vector: “Put your effort into learning about vectorized alternatives. It pays off in terms of efficiency.”

In a 2014 blog post, “Vectorization in R: Why?”, Noam Ross wrote it sometimes “may make sense to use a for loop, especially if they are more intuitive or easier to read for you” (emphasis in the original). His post is a good explanation of how vectorization works in R and why.

Jenny Bryan of the University of British Columbia posted an excellent speaker deck on row-wise operations in R. One slide makes the plea, “Of course someone has to write loops. It doesn’t have to be you.”

With the Vector People

Full disclosure: my position is with the vector people. I learned R from those who believe loops aren’t necessary and that a vector solution is always preferable. It works pretty well for me and I have gone a long while without writing a loop.

Until now.

I decided to challenge myself recently when a small project landed on my desk. It involved creating a pivot table using data from some utility customers. One chunk of code would have been a good candidate for a loop, but I stuck to my principles and simply wrote it out in a straightforward, line by line manner. The code worked fine and it produced the desired pivot table.

Someone saw the code and commented that I really should have used a loop there. I said I thought about it, but I try to avoid loops at all costs so I went with a more copy-and-paste approach. His problem was I violated the DRY rule (Don’t Repeat Yourself) by copying the original command 23 times (24 lines in all) and adjusting the arguments slightly in each subsequent line. That’s true. Everything is a trade-off.

I was editing a character string. I had to delete all the characters but one in a four-letter code, then I had to convert the surviving letter to one of three words, depending on what the character was. So what started as a code ended up as a word. I had to do this for three data files, each with about 50,000 rows.

Putting Intuition to the Test

Frankly, I didn’t know offhand how to get R to delete in one shot the three characters I wanted gone and then convert the remaining character into the appropriate word. I knew I could do the job a step at a time with gsub(). There would be minimal Googling and searching StackOverflow and I would have a straight-line vector result. So not using a loop looked like the best use of my time, despite the DRY violation.

Yesterday, I decided to go back and test my intuition. So I recreated my effort to write the long code.

I took five minutes to Google a loop solution for the problem and went through StackOverflow looking for examples. I couldn’t find anything I could work with. I read the gsub() help page on R to make sure I got the order of arguments right. That took another five minutes. Then I sat down to produce the 24 lines of code, copy-and-paste style. That took another five minutes. It was tedious, sure, but it didn’t take long. Just 15 minutes and I was done.

Here is what the code looks like in the original form:

    FY15W$rate_code <- gsub('W', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('1', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('2', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('3', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('4', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('R', 'Residential', FY15W$rate_code)
    FY15W$rate_code <- gsub('C', 'Commercial', FY15W$rate_code)
    FY15W$rate_code <- gsub('M', 'Multi-Family', FY15W$rate_code)
    FY16W$rate_code <- gsub('W', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('1', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('2', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('3', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('4', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('R', 'Residential', FY16W$rate_code)
    FY16W$rate_code <- gsub('C', 'Commercial', FY16W$rate_code)
    FY16W$rate_code <- gsub('M', 'Multi-Family', FY16W$rate_code)
    FY17W$rate_code <- gsub('W', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('1', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('2', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('3', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('4', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('R', 'Residential', FY17W$rate_code)
    FY17W$rate_code <- gsub('C', 'Commercial', FY17W$rate_code)
    FY17W$rate_code <- gsub('M', 'Multi-Family', FY17W$rate_code)

It is pretty repetitive and was tedious to put together. But it works fine.

Since this was a recreation of my first effort, let’s double that 15 minutes and say it took me half an hour the first time. I had to check for errors, fix typos, and test the code to make sure it worked properly, so 30 minutes isn’t a bad estimate.

Building the Loop

Then I went to build a loop that would have the same functionality. Here is the code for the loop:

    for (i in seq_along(WaterData$Class)) {
            WaterData$Class <- gsub('W', '', WaterData$Class)
            WaterData$Class <- gsub('1', '', WaterData$Class)
            WaterData$Class <- gsub('2', '', WaterData$Class)
            WaterData$Class <- gsub('3', '', WaterData$Class)
            WaterData$Class <- gsub('4', '', WaterData$Class)
            WaterData$Class <- gsub('R', 'Residential', WaterData$Class)
            WaterData$Class <- gsub('C', 'Commercial', WaterData$Class)
            WaterData$Class <- gsub('M', 'Multi-Family', WaterData$Class)
            break
    }

My intuition was right. It took me about 1 hour and 20 minutes. The loop is ten lines of code, 60% fewer lines, but it took three times longer to create. So the loop was not more efficient in terms of time spent creating the code. I have to admit it is prettier and more readable, but this is throwaway code for a one-shot analysis so those weren’t high on my list of priorities.

Break Time

I had to spend a good part of that time debugging the loop because it didn’t work the way I expected right out of the box. It deleted the unwanted characters just fine but when it came to replacing a character with a word, the loop worked too well. Instead of replacing the remaining C with the word Commercial, our loop friend looped back to the C in “Commercial” to produced “Commercialommercial” and then back again to yield “Commercialommercialommercial” and so on until I had a string of clipped “Commercial” nine long!

I realized in a bit I had to insert a break to shut down the loop. Then it worked beautifully and quickly. But this illustrates really well the problem of using loops in a vector environment. They just don’t work how they should. And they can take more time to create than a simple, straightforward, non-loopy solution.

Update (18 Apr 18)

A much better, non-loop solution, was suggested by Chuck Powers, who writes an informative blog. Chuck points out that the stringr package, part of the tidyverse, includes a function called str_replace_all(). Sure enough, I used followed his adviced and reduced the 24-line chunk above to these six lines:

FY15W$rate_code <- str_replace_all(FY15W$rate_code, '[W1234]', '')
FY15W$rate_code <- str_replace_all(FY15W$rate_code, c('R'= 'Residential', 'C' = 'Commercial', 'M' = 'Multi-Family'))
FY16W$rate_code <- str_replace_all(FY16W$rate_code, '[W1234]', '')
FY16W$rate_code <- str_replace_all(FY16W$rate_code, c('R'= 'Residential', 'C' = 'Commercial', 'M' = 'Multi-Family'))
FY17W$rate_code <- str_replace_all(FY17W$rate_code, '[W1234]', '')
FY17W$rate_code <- str_replace_all(FY17W$rate_code, c('R'= 'Residential', 'C' = 'Commercial', 'M' = 'Multi-Family'))

All without looping! Thanks to Chuck for the tip.