MungeX-3D: The Blog Posts

Coming to $terms in R

A recent analysis I worked on involved building a log regression and some ensemble methods using a data set with about 25 features, in addition to the target. It was an analysis of customer churn in the telecom industry. If you are interested, you can find the problem statement here, the annotated code here, and the raw code in my GitHub repository.

The problem I ran into came up later, when I wanted to reproduce a step in this analysis. I forgot the R command I used! @DarrinLRogers came through with the reminder so I thought I should write it all down in a post. You know, for next time I forget.

Working with Regression Models

I was doing some feature selection and feature engineering, so I expected to reduce the number of features significantly. Certain features were highly correlated, and there was also some redundancy, both of which meant pruning back the number of features. The number increased to 32, by the way, because almost all features had to be replaced with dummy variables.

I took advantage of the neat shortcut in R that allows you to use the Y ~ ., argument to pass all the features into a regression function. I was using glm() but it is the same syntax for lm(). The code looked like this:

LogModel2 <- glm(Churn ~ ., family = binomial(link = 'logit'), data = training)

Running the summary() command on a glm or lm object gives an indication of the relative importance of the features in the data set, which are now terms in the regression equation. Importance is given by the number of asterisks to the right of the column of p-values. You can use these to determine which terms to prune from the equation.

A similar selection analysis can be done running the anova() command. The results won’t necessarily be the same as the summary() command’s so comparisons can be helpful.

About 12 terms had to be cut out of the regression model. They weren’t adding much to performance. Having a simpler model is best, so cutting out these features made sense. But with that many to remove, rewriting the glm() command gets kind of tedious. Fortunately, I came across an R tip I found very useful. It made it possible to output the full regression model with all terms expressed, not abbreviated in the Y ~ ., syntax.

All I had to do was copy the full equation and remove the terms I didn’t need. I ended up with this:

LogModel2 <- glm(Churn ~ TotalCharges + PhoneService_Yes + MultipleLines_Yes + InternetService_DSL + 
                         `InternetService_Fiber optic` + OnlineSecurity_Yes + 
                         TechSupport_Yes + StreamingTV_Yes + StreamingMovies_Yes + 
                         `Contract_Month-to-month` + `Contract_One year` + PaperlessBilling_Yes +
                         `PaymentMethod_Electronic check` + `tenure_group_0 - 6 Mos`,
                         family = binomial(link = 'logit'), 
                         data = training)

If there is another way of doing this (and I’m sure there is) I would like to hear about it.

The call that yields this result is, in this case LogModel2$terms. It produces a lot of information about the object but the output of the full equation with all terms was most useful in this analysis.

The Takeaway

Later, I wanted to do something similar with another project I was working on, but I forgot how I did it the first time! I drew a total blank. I spent hours googling and going through my history in R-Studio but I couldn’t find the code. It was terribly frustrating.

Finally, in desperation, I turned to the #rstats channel in Twitter and the R4ds Slack group. It was a mistake. I should have gone there first.

I got a lot of good responses and helpful ideas on both channels but none was exactly what I was looking for. I realized this object$terms call might not be well known in the community. So I vowed when I figured it out, or if someone pointed me in the right direction, I would pass along the knowledge in a blog post.

Fortunately, @DarrinLRogers came through with the solution I was looking for. Thanks to him and all the others who pitched in with good ideas.

You can learn more about the R4ds Slack channel from Jesse Meagan. Sign up for R4ds Slack here.

Mission accomplished.

To Loop or Not to Loop?

The question of whether to loop in R or not, or what are the appropriate circumstances which would lead someone to loop in R, has strong proponents on both sides. Some argue loops are useful in R and there is nothing wrong with using them. On the other side are those who refuse to use loops at all because the loop structure is contrary to R‘s vectorized approach.

This one, a tutorial on how to use loops in R, manages to be on both sides at once. It advises, “after you have gotten a clear understanding of loops, get rid of them.” It comes down on the side of the vector: “Put your effort into learning about vectorized alternatives. It pays off in terms of efficiency.”

In a 2014 blog post, “Vectorization in R: Why?”, Noam Ross wrote it sometimes “may make sense to use a for loop, especially if they are more intuitive or easier to read for you” (emphasis in the original). His post is a good explanation of how vectorization works in R and why.

Jenny Bryan of the University of British Columbia posted an excellent speaker deck on row-wise operations in R. One slide makes the plea, “Of course someone has to write loops. It doesn’t have to be you.”

With the Vector People

Full disclosure: my position is with the vector people. I learned R from those who believe loops aren’t necessary and that a vector solution is always preferable. It works pretty well for me and I have gone a long while without writing a loop.

Until now.

I decided to challenge myself recently when a small project landed on my desk. It involved creating a pivot table using data from some utility customers. One chunk of code would have been a good candidate for a loop, but I stuck to my principles and simply wrote it out in a straightforward, line by line manner. The code worked fine and it produced the desired pivot table.

Someone saw the code and commented that I really should have used a loop there. I said I thought about it, but I try to avoid loops at all costs so I went with a more copy-and-paste approach. His problem was I violated the DRY rule (Don’t Repeat Yourself) by copying the original command 23 times (24 lines in all) and adjusting the arguments slightly in each subsequent line. That’s true. Everything is a trade-off.

I was editing a character string. I had to delete all the characters but one in a four-letter code, then I had to convert the surviving letter to one of three words, depending on what the character was. So what started as a code ended up as a word. I had to do this for three data files, each with about 50,000 rows.

Putting Intuition to the Test

Frankly, I didn’t know offhand how to get R to delete in one shot the three characters I wanted gone and then convert the remaining character into the appropriate word. I knew I could do the job a step at a time with gsub(). There would be minimal Googling and searching StackOverflow and I would have a straight-line vector result. So not using a loop looked like the best use of my time, despite the DRY violation.

Yesterday, I decided to go back and test my intuition. So I recreated my effort to write the long code.

I took five minutes to Google a loop solution for the problem and went through StackOverflow looking for examples. I couldn’t find anything I could work with. I read the gsub() help page on R to make sure I got the order of arguments right. That took another five minutes. Then I sat down to produce the 24 lines of code, copy-and-paste style. That took another five minutes. It was tedious, sure, but it didn’t take long. Just 15 minutes and I was done.

Here is what the code looks like in the original form:

    FY15W$rate_code <- gsub('W', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('1', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('2', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('3', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('4', '', FY15W$rate_code)
    FY15W$rate_code <- gsub('R', 'Residential', FY15W$rate_code)
    FY15W$rate_code <- gsub('C', 'Commercial', FY15W$rate_code)
    FY15W$rate_code <- gsub('M', 'Multi-Family', FY15W$rate_code)
    FY16W$rate_code <- gsub('W', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('1', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('2', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('3', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('4', '', FY16W$rate_code)
    FY16W$rate_code <- gsub('R', 'Residential', FY16W$rate_code)
    FY16W$rate_code <- gsub('C', 'Commercial', FY16W$rate_code)
    FY16W$rate_code <- gsub('M', 'Multi-Family', FY16W$rate_code)
    FY17W$rate_code <- gsub('W', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('1', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('2', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('3', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('4', '', FY17W$rate_code)
    FY17W$rate_code <- gsub('R', 'Residential', FY17W$rate_code)
    FY17W$rate_code <- gsub('C', 'Commercial', FY17W$rate_code)
    FY17W$rate_code <- gsub('M', 'Multi-Family', FY17W$rate_code)

It is pretty repetitive and was tedious to put together. But it works fine.

Since this was a recreation of my first effort, let’s double that 15 minutes and say it took me half an hour the first time. I had to check for errors, fix typos, and test the code to make sure it worked properly, so 30 minutes isn’t a bad estimate.

Building the Loop

Then I went to build a loop that would have the same functionality. Here is the code for the loop:

    for (i in seq_along(WaterData$Class)) {
            WaterData$Class <- gsub('W', '', WaterData$Class)
            WaterData$Class <- gsub('1', '', WaterData$Class)
            WaterData$Class <- gsub('2', '', WaterData$Class)
            WaterData$Class <- gsub('3', '', WaterData$Class)
            WaterData$Class <- gsub('4', '', WaterData$Class)
            WaterData$Class <- gsub('R', 'Residential', WaterData$Class)
            WaterData$Class <- gsub('C', 'Commercial', WaterData$Class)
            WaterData$Class <- gsub('M', 'Multi-Family', WaterData$Class)

My intuition was right. It took me about 1 hour and 20 minutes. The loop is ten lines of code, 60% fewer lines, but it took three times longer to create. So the loop was not more efficient in terms of time spent creating the code. I have to admit it is prettier and more readable, but this is throwaway code for a one-shot analysis so those weren’t high on my list of priorities.

Break Time

I had to spend a good part of that time debugging the loop because it didn’t work the way I expected right out of the box. It deleted the unwanted characters just fine but when it came to replacing a character with a word, the loop worked too well. Instead of replacing the remaining C with the word Commercial, our loop friend looped back to the C in “Commercial” to produced “Commercialommercial” and then back again to yield “Commercialommercialommercial” and so on until I had a string of clipped “Commercial” nine long!

I realized in a bit I had to insert a break to shut down the loop. Then it worked beautifully and quickly. But this illustrates really well the problem of using loops in a vector environment. They just don’t work how they should. And they can take more time to create than a simple, straightforward, non-loopy solution.

Update (18 Apr 18)

A much better, non-loop solution, was suggested by Chuck Powers, who writes an informative blog. Chuck points out that the stringr package, part of the tidyverse, includes a function called str_replace_all(). Sure enough, I used followed his adviced and reduced the 24-line chunk above to these six lines:

FY15W$rate_code <- str_replace_all(FY15W$rate_code, '[W1234]', '')
FY15W$rate_code <- str_replace_all(FY15W$rate_code, c('R'= 'Residential', 'C' = 'Commercial', 'M' = 'Multi-Family'))
FY16W$rate_code <- str_replace_all(FY16W$rate_code, '[W1234]', '')
FY16W$rate_code <- str_replace_all(FY16W$rate_code, c('R'= 'Residential', 'C' = 'Commercial', 'M' = 'Multi-Family'))
FY17W$rate_code <- str_replace_all(FY17W$rate_code, '[W1234]', '')
FY17W$rate_code <- str_replace_all(FY17W$rate_code, c('R'= 'Residential', 'C' = 'Commercial', 'M' = 'Multi-Family'))

All without looping! Thanks to Chuck for the tip.

Python’s Keras Library in R, Part 2

So I learned in the previous post that if an R user wants to load the Python keras library into R to run neural net models, it is necessary to load Python first. The keras package in R is an interface with Python, not a standalone package.

That’s fine, but it would have been nice to know beforehand. So I thought I should write it down for others.

Loaded Anaconda 3 Earlier

Fortunately, I loaded Anaconda 3 into my system earlier this year in preparation for a program at UCLA on data science. We have been using Jupyter Notebooks and, lately, Jupyter Lab to run both R and Python code, so I have much of the Python infrastructure set up. Anaconda 3 is a pretty easy installation, though it does take some time due to its size. It is a good place to start if you need a Python environment.

Anaconda Navigator is the main package in the Anaconda 3 suite, and it comes with a version of R Studio. I can’t say anything about that, but some users might find it convenient to have both Python and R Studio in the same software suite.

If you are working in Notebook or Lab, Docker is another useful program to have running on your system. You have to access and operate it through the shell. A thorough treatment of Docker and all its intricacies can be found in Joshua Cook’s book, Docker for Data Science: Building Scalable and Extensible Data Infrastructure Around the Jupyter Notebook Server, available from Apress.

One way or another, however you work with Python, setting it up is necessary for loading the keras package into R. And since keras works with TensorFlow you will need to load the R library tensorflow, as well, but that should not be too demanding.

Easiest to Work with Python in the Command Window

Once you have set up Anaconda 3, it is probably easiest to work with Python through the command window, Anaconda Prompt. The keras package does not come preloaded in Anaconda, so you have to install it. I found on Git the code making this possible, and if you are not familiar with Python it might be easiest just to follow this approach.

At the command line in Anaconda Prompt, you need to enter:

pip install keras.

Then import keras.

That’s it.

Though I don’t have personal experience using it, another method I understand works, which you should try only if the previous command does not help, is to enter:

sudo pip3 install keras.

A common mistake is to enter instead:

conda install keras

So try to avoid that.

It takes a while to load keras into Python, so be patient and enjoy watching the forward slash spin around while the program does its thing.

Good to Go

Once you have keras loaded, go back to your R environment and install and load the CRAN version of the library.

You should be good to go.

The Pyimagesearch blog has a good post on the keras installation in much more detail if you are interested. I didn’t find it necessary to carry out the steps described there for editing the keras.json config file, setting up GPU support, or accessing OpenCV bindings, but if you do this is a good reference.

We can talk about the R interface with TensorFlow some other time.

Loading Python’s Keras into R, Part 1

Late last year, Matt Dancho had a post on deep learning celebrating the arrival of the Python keras package for R. It is a very good tutorial on using artificial neural networks (ANN) to solve complicated business problems, well worth checking out.

Took More Doing Than I Thought

I started working with neural networks over a decade ago with Palisade Decision Tree software, which includes NeuralTools, a neural network add-in for Excel. It’s a quality program that works well, but it is subject to constraints imposed by Excel. So I looked forward to playing around with keras and getting a sense of how R works with neural nets.

What I didn’t know is that in order to use keras in R it is necessary to have the keras Python library loaded and ready to go. This took more doing than I thought it would.

Of course, R has native neural network and deep learning packages, such as nnet and RSNNS, among others. But the idea of R joining forces with Python to implement a keras package is a welcome addition and one I wanted to try. I went through the R-Studio cheat sheet on keras and decided to make a go.

Straight to GTS Mode

Things went smoothly until I got to actually building and running the keras model. I was immediately faced with a long list of warnings followed by the failure of the model to run. I ran the code a couple more times to see if I could figure out what was going on. Each time, the same warnings popped up.

In looking closely at the warnings I finally noticed, buried among them towards the bottom, this error message:

ModuleNotFoundError: No module named 'keras'

I checked to make sure the keras library was loaded in my environment and running. It was. A lesson from a long ago data science class came to mind and I went straight to GTS mode. All I could find were references to keras in Python. There was nothing about this error message in R.

GitHub was the most help. There I found a thread on “No module named keras: #4889”. But it was short and was closed down due to lack of use in late 2017.

That thread contained a few snippets of Python code that helped me figure out the problem. For keras to run in R you need to have keras loaded in Python. Which means you need to have Anaconda Prompt or JupyterLab loaded in your system, as well as R.

Lesson Learned

This was news to me. It’s not mentioned in the keras cheat sheet or in Matt’s blog post.

In fact, the keras cheat sheet mentions in the “Installation” section that “the keras R package uses the Python keras library. You can install all the prerequisites directly from R.”

That wasn’t the case for me. There’s a note that says “See ?keras_install for GPU instructions,” but when I run the command I get “No results found.”

I guess it is common knowledge, but somehow I did not get the memo. Many others are probably unaware. Hence, this post.

The lesson here is read the documentation. Keras in R is the interface to Python’s keras. No Python, no keras in R.

More on this is the next post.

Displaying HTML Files in GitHub

Setting up an HTML page in GitHub is not difficult but it is a bit lengthy. Just follow these steps.

  1. Create the HTML from your RMarkdown document and save to your local directory.
  2. Create a new repo, or go to an existing repo, in your GitHub.
  3. In the repo, create a new file called docs/index.html.
  4. In the index.html file, type this code: <html><body><p>Hello World</p></body></html>
  5. Commit the file.
  6. In the repo, click on the Settings tab.
  7. Scroll down to GitHub Pages Section.
  8. Under “Source”, choose “master branch/docs folder” and save. If that doesn’t work, try the “master branch” and save.
  9. A message will appear that your site is read to be published. Click on the link.
  10. When Git finds the index.html file in a docs/ folder, it creates a web address for the contents. The address takes the form:
  11. Go to that address. You should see your “Hello World” greeting. That is your index.html file.
  12. Go back to the docs/ folder and open it.
  13. Upload the HTML file you created in RMarkdown and commit.
  14. Return to your site but add the HTML file name to the address:
  15. You should find your HTML posted there.