Coming to $terms in R

A recent analysis I worked on involved building a log regression and some ensemble methods using a data set with about 25 features, in addition to the target. It was an analysis of customer churn in the telecom industry. If you are interested, you can find the problem statement here, the annotated code here, and the raw code in my GitHub repository.

The problem I ran into came up later, when I wanted to reproduce a step in this analysis. I forgot the R command I used! @DarrinLRogers came through with the reminder so I thought I should write it all down in a post. You know, for next time I forget.

Working with Regression Models

I was doing some feature selection and feature engineering, so I expected to reduce the number of features significantly. Certain features were highly correlated, and there was also some redundancy, both of which meant pruning back the number of features. The number increased to 32, by the way, because almost all features had to be replaced with dummy variables.

I took advantage of the neat shortcut in R that allows you to use the Y ~ ., argument to pass all the features into a regression function. I was using glm() but it is the same syntax for lm(). The code looked like this:

LogModel2 <- glm(Churn ~ ., family = binomial(link = 'logit'), data = training)

Running the summary() command on a glm or lm object gives an indication of the relative importance of the features in the data set, which are now terms in the regression equation. Importance is given by the number of asterisks to the right of the column of p-values. You can use these to determine which terms to prune from the equation.

A similar selection analysis can be done running the anova() command. The results won’t necessarily be the same as the summary() command’s so comparisons can be helpful.

About 12 terms had to be cut out of the regression model. They weren’t adding much to performance. Having a simpler model is best, so cutting out these features made sense. But with that many to remove, rewriting the glm() command gets kind of tedious. Fortunately, I came across an R tip I found very useful. It made it possible to output the full regression model with all terms expressed, not abbreviated in the Y ~ ., syntax.

All I had to do was copy the full equation and remove the terms I didn’t need. I ended up with this:

LogModel2 <- glm(Churn ~ TotalCharges + PhoneService_Yes + MultipleLines_Yes + InternetService_DSL + 
                         `InternetService_Fiber optic` + OnlineSecurity_Yes + 
                         TechSupport_Yes + StreamingTV_Yes + StreamingMovies_Yes + 
                         `Contract_Month-to-month` + `Contract_One year` + PaperlessBilling_Yes +
                         `PaymentMethod_Electronic check` + `tenure_group_0 - 6 Mos`,
                         family = binomial(link = 'logit'), 
                         data = training)

If there is another way of doing this (and I’m sure there is) I would like to hear about it.

The call that yields this result is, in this case LogModel2$terms. It produces a lot of information about the object but the output of the full equation with all terms was most useful in this analysis.

The Takeaway

Later, I wanted to do something similar with another project I was working on, but I forgot how I did it the first time! I drew a total blank. I spent hours googling and going through my history in R-Studio but I couldn’t find the code. It was terribly frustrating.

Finally, in desperation, I turned to the #rstats channel in Twitter and the R4ds Slack group. It was a mistake. I should have gone there first.

I got a lot of good responses and helpful ideas on both channels but none was exactly what I was looking for. I realized this object$terms call might not be well known in the community. So I vowed when I figured it out, or if someone pointed me in the right direction, I would pass along the knowledge in a blog post.

Fortunately, @DarrinLRogers came through with the solution I was looking for. Thanks to him and all the others who pitched in with good ideas.

You can learn more about the R4ds Slack channel from Jesse Meagan. Sign up for R4ds Slack here.

Mission accomplished.

Python’s Keras Library in R, Part 2

So I learned in the previous post that if an R user wants to load the Python keras library into R to run neural net models, it is necessary to load Python first. The keras package in R is an interface with Python, not a standalone package.

That’s fine, but it would have been nice to know beforehand. So I thought I should write it down for others.

Loaded Anaconda 3 Earlier

Fortunately, I loaded Anaconda 3 into my system earlier this year in preparation for a program at UCLA on data science. We have been using Jupyter Notebooks and, lately, Jupyter Lab to run both R and Python code, so I have much of the Python infrastructure set up. Anaconda 3 is a pretty easy installation, though it does take some time due to its size. It is a good place to start if you need a Python environment.

Anaconda Navigator is the main package in the Anaconda 3 suite, and it comes with a version of R Studio. I can’t say anything about that, but some users might find it convenient to have both Python and R Studio in the same software suite.

If you are working in Notebook or Lab, Docker is another useful program to have running on your system. You have to access and operate it through the shell. A thorough treatment of Docker and all its intricacies can be found in Joshua Cook’s book, Docker for Data Science: Building Scalable and Extensible Data Infrastructure Around the Jupyter Notebook Server, available from Apress.

One way or another, however you work with Python, setting it up is necessary for loading the keras package into R. And since keras works with TensorFlow you will need to load the R library tensorflow, as well, but that should not be too demanding.

Easiest to Work with Python in the Command Window

Once you have set up Anaconda 3, it is probably easiest to work with Python through the command window, Anaconda Prompt. The keras package does not come preloaded in Anaconda, so you have to install it. I found on Git the code making this possible, and if you are not familiar with Python it might be easiest just to follow this approach.

At the command line in Anaconda Prompt, you need to enter:

pip install keras.

Then import keras.

That’s it.

Though I don’t have personal experience using it, another method I understand works, which you should try only if the previous command does not help, is to enter:

sudo pip3 install keras.

A common mistake is to enter instead:

conda install keras

So try to avoid that.

It takes a while to load keras into Python, so be patient and enjoy watching the forward slash spin around while the program does its thing.

Good to Go

Once you have keras loaded, go back to your R environment and install and load the CRAN version of the library.

You should be good to go.

The Pyimagesearch blog has a good post on the keras installation in much more detail if you are interested. I didn’t find it necessary to carry out the steps described there for editing the keras.json config file, setting up GPU support, or accessing OpenCV bindings, but if you do this is a good reference.

We can talk about the R interface with TensorFlow some other time.

Loading Python’s Keras into R, Part 1

Late last year, Matt Dancho had a post on deep learning celebrating the arrival of the Python keras package for R. It is a very good tutorial on using artificial neural networks (ANN) to solve complicated business problems, well worth checking out.

Took More Doing Than I Thought

I started working with neural networks over a decade ago with Palisade Decision Tree software, which includes NeuralTools, a neural network add-in for Excel. It’s a quality program that works well, but it is subject to constraints imposed by Excel. So I looked forward to playing around with keras and getting a sense of how R works with neural nets.

What I didn’t know is that in order to use keras in R it is necessary to have the keras Python library loaded and ready to go. This took more doing than I thought it would.

Of course, R has native neural network and deep learning packages, such as nnet and RSNNS, among others. But the idea of R joining forces with Python to implement a keras package is a welcome addition and one I wanted to try. I went through the R-Studio cheat sheet on keras and decided to make a go.

Straight to GTS Mode

Things went smoothly until I got to actually building and running the keras model. I was immediately faced with a long list of warnings followed by the failure of the model to run. I ran the code a couple more times to see if I could figure out what was going on. Each time, the same warnings popped up.

In looking closely at the warnings I finally noticed, buried among them towards the bottom, this error message:

ModuleNotFoundError: No module named 'keras'

I checked to make sure the keras library was loaded in my environment and running. It was. A lesson from a long ago data science class came to mind and I went straight to GTS mode. All I could find were references to keras in Python. There was nothing about this error message in R.

GitHub was the most help. There I found a thread on “No module named keras: #4889”. But it was short and was closed down due to lack of use in late 2017.

That thread contained a few snippets of Python code that helped me figure out the problem. For keras to run in R you need to have keras loaded in Python. Which means you need to have Anaconda Prompt or JupyterLab loaded in your system, as well as R.

Lesson Learned

This was news to me. It’s not mentioned in the keras cheat sheet or in Matt’s blog post.

In fact, the keras cheat sheet mentions in the “Installation” section that “the keras R package uses the Python keras library. You can install all the prerequisites directly from R.”

That wasn’t the case for me. There’s a note that says “See ?keras_install for GPU instructions,” but when I run the command I get “No results found.”

I guess it is common knowledge, but somehow I did not get the memo. Many others are probably unaware. Hence, this post.

The lesson here is read the documentation. Keras in R is the interface to Python’s keras. No Python, no keras in R.

More on this is the next post.