# Email spam filtering: Text analysis in R

Friday, August 25, 2017

Email spam1, also known as junk email, is a type of electronic spam where unsolicited messages are sent by email.

As a result of the huge number of spam emails being sent across the Internet each day, most email providers offer a spam filter that automatically flags likely spam messages and separates them from the ham. Although these filters use a number of techniques, most rely heavily on the analysis of the contents of an email via text analytics.

Let’s build and evaluate a spam filter using a publicly available dataset emails.csv2 with the codebook.

Follow the standard steps to build and pre-process the corpus:

We are now ready to extract the word frequencies to be used in our prediction problem.
Let’s create a DocumentTermMatrix where the rows correspond to documents (emails), and the columns correspond to words.

To obtain a more reasonable number of terms, limit dtm to contain terms appearing in at least 5% of documents.

The lists of most common words are significantly different between the spam and ham emails.
A word stem like enron, which is extremely common in the ham emails but does not occur in any spam message, will help us correctly identify a large number of ham messages.

Now, let’s build our machine learning models.

## Logistic Regresssion model

None of the variables is significant in our logistic regression model. Note that the logistic regression model yielded the messages algorithm did not converge and fitted probabilities numerically 0 or 1 occurred. Both of these messages often indicate overfitting and the first indicates particularly severe overfitting.

## Prediction on training data

In terms of both accuracy and AUC, logistic regression is nearly perfect and outperforms the other two models.

## Prediction on testing data

The random forest outperformed logistic regression and CART in both measures, obtaining an impressive AUC of 0.997 on the test set.

The logistic regression model obtained nearly perfect accuracy and AUC on the training set and had far-from-perfect performance on the testing set. This is an indicator of overfitting. A logistic regression model with a large number of variables is particularly at risk for overfitting.

Most of the email providers move all of the emails flagged as spam to a separate “Junk Email” folder, meaning those emails are not displayed in the main inbox. Many users never check the spam folder, so they will never see emails delivered there.
A false negative is largely a nuisance (the user will need to delete the unsolicited email). However, a false positive can be very costly, since the user might completely miss an important email due to it being delivered to the spam folder. Therefore, the false positive is more costly.

Nevertheless, it may be the case that a user who is particularly annoyed by spams would assign a particularly high cost to a false negative. While, users who never check spam folder will miss the email, incurring a particularly high cost to false positive. Thus, a large-scale email provider need to automatically collect information about how often each user accesses his/her Junk Email folder to infer preferences. That’s what most email providers do.

