Using natural language processing to build a spam filter for text messages

After watching the film Arrival, I developed a deep appreciation for the field of linguistics (also my favorite movie of 2016). Human language is the most unstructured type of data, and yet we effortlessly parse and interpret it, and even generate our own. On the other hand, understanding everyday language is a significant challenge for machines; this is the focus of natural language processing (NLP)—the crossroads between linguistics and AI. In this post, we'll make use of some NLP concepts and combine them with machine learning to build a spam filter for SMS text messages.

In the 1950s, Alan Turing conjectured if a machine can fool a human into believing that he/she is speaking with another human, the machine exhibits intelligence—the iconic Turing test. Recently, a machine arguably passed the Turing test for the first time; the milestone was largely attributed to advances in NLP. If you've ever wondered how Siri or Google recognizes your voice, makes sense of a semantically nebulous question, such as "How did the Celtics do?", and returns an accurate response, that's also NLP in action! The war on spam is another.

Before Gmail implemented its incredible spam filter, I remember crafting an elaborate set of if/then rules for words typically found in spam. Not surprisingly, it was tedious and prone to mistakes. Nowadays, no one hand codes a spam filter—we train machines to do the job! But email spam is different from SMS message spam; we tend to speak more casually on our phones, using more slang, incomplete sentences, or even concocting words and phrases. Let's see what machine learning can do for SMS message spam.

Table of contents

  1. Inspecting the dataset
  2. Text preprocessing
  3. Feature engineering
  4. Training and evaluating a model
  5. What terms are the top predictors of spam?
  6. How to improve the model

1. Inspecting the dataset

While browsing the UCI Machine Learning repository, I discovered a dataset chock-full of anonymous SMS text messages. Let's begin by loading it and taking a look around.

In [2]:
df = pd.read_table('data/SMSSpamCollection.txt', header=None)
df.head()
Out[2]:
0 1
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
0    5572 non-null object
1    5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB

We have a collection of text data known as a corpus. Specifically, there are 5,572 SMS messages written in English, serving as training examples. The first column is the target variable containing the class labels, which tells us if the message is spam or ham (aka not spam). The second column is the SMS message itself, stored as a string.

Since the target variable contains discrete values, this is a classification task. Let's start by placing the target variable in its own table and checking out how the two classes are distributed.

In [4]:
y = df[0]
y.value_counts()
Out[4]:
ham     4825
spam     747
Name: 0, dtype: int64

It looks like there are far fewer training examples for spam than ham—we'll take this imbalance into account in the analysis. In addition, we need to encode the class labels in the target variable as numbers to ensure compatibility with some models in Scikit-learn. Because we have binary classes, let's use LabelEncoder and set 'spam' = 1 and 'ham' = 0.

In [5]:
le = LabelEncoder()
y_enc = le.fit_transform(y)

Next, let's place the SMS message data into its own table.

In [6]:
raw_text = df[1]

We're sitting on a mountain of text data but it might as well be gibberish since a machine only understands numbers. How can we convert this corpus into useful numerical features so we can train a classifier? This is where NLP comes in.

2. Text preprocessing

There are many feature engineering strategies for transforming text data into features. Some involve assigning each unique word-like term to a feature and counting the number of occurrences per training example. However, if we were to perform this strategy right now, we'd end up with an absurd number of features, a result of the myriad possible terms. The classifier would take too long to train and likely overfit. As a result, each NLP problem requires a tailored approach to determine which terms are relevant and meaningful.

For the remainder of this section, I'll walk through my preprocessing strategy, which relies heavily on regular expressions.

Normalization

Let's begin by taking a step back and examining the terms of a hypothetical SMS message.

In [7]:
example = """  ***** CONGRATlations **** You won 2 tIckETs to Hamilton in 
NYC http://www.hamiltonbroadway.com/J?NaIOl/event   wORtH over $500.00...CALL 
555-477-8914 or send message to: hamilton@freetix.com to get ticket !! !  """

I'd definitely deem this as spam. But clearly there's a lot going on here: phone numbers, emails, website URLs, money amounts, and gratuitous whitespace and punctuation. Some terms are randomly capitalized, others are in all-caps. Since these terms might show up in any one of the training examples in countless forms, we need a way to ensure each training example is on equal footing via a preprocessing step called normalization.

Instead of removing the following terms, for each training example, let's replace them with a specific string.

  • Replace email addresses with 'emailaddr'
  • Replace URLs with 'httpaddr'
  • Replace money symbols with 'moneysymb'
  • Replace phone numbers with 'phonenumbr'
  • Replace numbers with 'numbr'
In [8]:
processed = raw_text.str.replace(r'\b[\w\-.]+?@\w+?\.\w{2,4}\b',
                                 'emailaddr')
processed = processed.str.replace(r'(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)',
                                  'httpaddr')
processed = processed.str.replace(r'£|\$', 'moneysymb')    
processed = processed.str.replace(
    r'\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b',
    'phonenumbr')    
processed = processed.str.replace(r'\d+(\.\d+)?', 'numbr')

Next, we'll remove all punctuation since "today" and "today?" refer to the same word. In addition, let's collapse all whitespace (spaces, line breaks, tabs) into a single space. Furthermore, we'll eliminate any leading or trailing whitespace.

In [9]:
processed = processed.str.replace(r'[^\w\d\s]', ' ')
processed = processed.str.replace(r'\s+', ' ')
processed = processed.str.replace(r'^\s+|\s+?$', '')

We'll also need to treat the words "there", "There" and "ThERe" as the same word. Therefore, let's lowercase the entire corpus.

In [10]:
processed = processed.str.lower()

Removing stop words

Some words in the English language, while necessary, don't contribute much to the meaning of a phrase. These words, such as "when", "had", "those" or "before", are called stop words and should be filtered out. The Natural Language Toolkit (NLTK), a popular Python library for NLP, provides common stop words.

In [11]:
stop_words = nltk.corpus.stopwords.words('english')

This list of stop words is literally stored in a Python list. If instead we convert it to a Python set, iterating over the stop words will go much faster, and shave time off this preprocessing step.

In [12]:
processed = processed.apply(lambda x: ' '.join(
    term for term in x.split() if term not in set(stop_words))
)

Stemming

It's likely the corpus contains words with various suffixes such as "distribute", "distributing", "distributor" or "distribution". We can replace these four words with just "distribut" via a preprocessing step called stemming. There are numerous stemming strategies, some more aggressive than others. Instead of painstakingly building a stemmer from scratch, let's use one from NLTK called the Porter stemmer.

In [13]:
porter = nltk.PorterStemmer()
processed = processed.apply(lambda x: ' '.join(
    porter.stem(term) for term in x.split())
)

Because we have so many preprocessing steps, let's combine them all into a handy function that takes in a string and cleans it up.

In [14]:
def preprocess_text(messy_string):
    assert(type(messy_string) == str)
    cleaned = re.sub(r'\b[\w\-.]+?@\w+?\.\w{2,4}\b', 'emailaddr', messy_string)
    cleaned = re.sub(r'(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)', 'httpaddr',
                     cleaned)
    cleaned = re.sub(r'£|\$', 'moneysymb', cleaned)
    cleaned = re.sub(
        r'\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b',
        'phonenumbr', cleaned)
    cleaned = re.sub(r'\d+(\.\d+)?', 'numbr', cleaned)
    cleaned = re.sub(r'[^\w\d\s]', ' ', cleaned)
    cleaned = re.sub(r'\s+', ' ', cleaned)
    cleaned = re.sub(r'^\s+|\s+?$', '', cleaned.lower())
    return ' '.join(
        porter.stem(term) 
        for term in cleaned.split()
        if term not in set(stop_words)
    )

As a sanity check, let's confirm this function yields the same results as the previous series of steps.

In [15]:
(processed == raw_text.apply(preprocess_text)).all()
Out[15]:
True

Additionally, let's test preprocess_text() on the hypothethical SMS message from earlier.

In [16]:
preprocess_text(example)
Out[16]:
'congratl numbr ticket hamilton nyc httpaddr worth moneysymbnumbr call phonenumbr send messag emailaddr get ticket'

What a change! It looks like the preprocessing strategy is working. A quick technical note before we move on: since this was the first time I encountered regular expressions, I learned the hard way that mindful usage of the lazy quantifier ? is crucial—incorporating it reduced the total preprocessing time from 1 hour to 2 seconds!

3. Feature engineering

Now that we've enriched the corpus for meaningful terms, we're ready to construct features. Let's begin by breaking apart the corpus into a vocabulary of unique terms—a process called tokenization. However, there are several ways to approach this step.

Tokenization

We can tokenize individual terms and generate what's called a bag of words model. You may notice this model has a glaring pitfall: it fails to capture the innate structure of human language. Under this model, the following sentences have the same feature vector although they convey dramatically different meanings.

  • Does steak taste delicious?
  • Steak does taste delicious.

Alternatively, we can tokenize every sequence of $n$ terms called $n$-grams. For example, tokenizing adjacent pairs of words yields bigrams. The $n$-gram model preserves word order and can potentially capture more information than the bag of words model.

To get the best of both worlds, let's tokenize unigrams and bigrams. As an example, unigrams and bigrams for "The quick brown fox" are "The", "quick", "brown", "fox", "The quick", "quick brown" and "brown fox".

Implementing the tf-idf statistic

Having selected a tokenization strategy, the next step is assign each $n$-gram to a feature and then compute the $n$-gram's frequency using some statistic. Again, we have options.

One statistic called term frequency (tf) tallies the occurrences of each $n$-gram for every training example. However, some $n$-grams will undoubtedly show up often in any given SMS message, while others rarely appear in the overall corpus but show up frequently in certain subsets of messages such as spam. Therefore, to emphasize the latter, more interesting set of $n$-grams, we'll downweight the term frequency with inverse document frequency (idf), which is calculated by logarithmically scaling the inverse of the fraction of training examples that contain a given term. Combining these two statistics yields the tf-idf statistic:

$$\textrm{tf-idf}(t, i) = \textrm{tf}(t, i) \times \textrm{idf}(t) \\ = \textrm{tf}(t, i) \times \log\bigg(\frac{M}{m_t}\bigg)$$

where $\textrm{tf}(t,i)$ is the term frequency for term $t$ in the $i$th training example, $M$ is the total number of training examples, and $m_t$ is the number of training examples that contain the term $t$.

Scikit-learn has an off-the-shelf tool called TfidfVectorizer that performs $n$-gram tokenization and also computes the tf-idf statistic. Two technical details regarding TfidfVectorizer: a) the tf-idf statistic is computed slightly differently to avoid division by zero, and b) the computed tf-idf values for each training example are subsequently normalized.

Finally, we're equipped to transform a corpus of text data into a matrix of numbers with one row per training example and one column per $n$-gram.

In [17]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
X_ngrams = vectorizer.fit_transform(processed)

Let's take a look at the dimensions of the X_ngrams matrix.

In [18]:
X_ngrams.shape
Out[18]:
(5572, 36348)

Holy cow, that's one massive matrix! It looks like the tokenization process extracted 36,348 unigrams and bigrams from the corpus; each one defines a feature. Since each training example only makes use of a tiny fraction of the complete $n$-gram vocabulary, X_ngrams mostly consists of zeros and is called a sparse matrix. To perform linear algebra computations rapidly on such a large sparse matrix, it'd be more efficient to store only the non-zero values while maintaining the structure. Fortunately, TfidfVectorizer utilizes the SciPy library to do exactly this!

4. Training and evaluating a model

That was a lot of prep work and we haven't even done any machine learning yet, but all of it crucial for building a robust classifier. Not only is a model only as good as the data it's trained on, but meticulous feature engineering can beat having a fancy model. We'll be a training a state of the art classifier called support vector machines (SVM). In a nutshell, this model attempts to find the hyperplane that best separates the two classes.

In particular, I've elected to use SVM with a linear kernel because text data contains a large number of features (we have over 30,000!); using a nonlinear kernel would be computationally expensive. In addition, text data is typically linearly separable to begin with. Finally, we'll use svm.LinearSVC() instead of svm.SVC(kernel='linear') because the former relies on a library whose algorithmic complexity is $O(n)$ instead of $O(n^2)$ or $O(n^3)$, meaning much faster.

Preliminary analysis

For a rudimentary understanding of how well SVM performs on the dataset, let's start with the hold-out method: an 80/20 training and test set split. Remember to never evaluate a model on the data used to train it! In addition, since the classes are imbalanced, we'll incorporate stratification and use the $F_1$ score as the performance metric.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(
    X_ngrams,
    y_enc,
    test_size=0.2,
    random_state=42,
    stratify=y_enc
)

clf = svm.LinearSVC(loss='hinge')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
metrics.f1_score(y_test, y_pred)
Out[19]:
0.9285714285714286

Not a bad start, but we really need to run cross-validation to gauge whether this performance is consistent. While we're here, let's use a confusion matrix to take a peek at what types of mistakes the classifier is making.

In [20]:
pd.DataFrame(
    metrics.confusion_matrix(y_test, y_pred),
    index=[['actual', 'actual'], ['spam', 'ham']],
    columns=[['predicted', 'predicted'], ['spam', 'ham']]
)
Out[20]:
predicted
spam ham
actual spam 965 1
ham 19 130

It looks when the classifier does make a mistake, it's typically a false positive: flagging an SMS message as spam when it's actually not. Not exactly ideal but then again, this is just a single training/test set split.

Diagnosing the model with learning curves

Because we have so many features, it's prudent to determine whether the classifier suffers from high variance and is overfitting. We can assess the situation by plotting learning curves. Here's the procedure:

  1. Split the entire dataset $k$ times into a training and validation set
  2. For each split, train the model on subsets of the training set, each with fewer training examples
  3. Evaluate the model on the validation set and each subset of the training set
  4. Average the model performance across the $k$ splits for both training and validation sets

Specifically, we'll use 10-fold cross-validation (80/20 split) without regularization. For each fold, we'll train the classifier on 10 different dataset sizes, starting with 500 training examples. Fortunately, Scikit-learn has a convenient tool that computes learning curves in one go.

In [21]:
sample_space = np.linspace(500, len(raw_text) * 0.8, 10, dtype='int')

train_sizes, train_scores, valid_scores = learning_curve(
    estimator=svm.LinearSVC(loss='hinge', C=1e10),
    X=X_ngrams,
    y=y_enc,
    train_sizes=sample_space,
    cv=StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=40),
    scoring='f1',
    n_jobs=-1
)

I love the Seaborn library for visualization but like other high-level plotting libraries, it requires data to be in "tidy" format where each variable forms a column and each observation forms a row. Trust me, tidy data make data analysis and plotting so much easier. Let's define a function that combines train_scores and valid_scores into a single tidy table.

In [22]:
def make_tidy(sample_space, train_scores, valid_scores):
    messy_format = pd.DataFrame(
        np.stack((sample_space, train_scores.mean(axis=1),
                  valid_scores.mean(axis=1)), axis=1),
        columns=['# of training examples', 'Training set', 'Validation set']
    )
    
    return pd.melt(
        messy_format,
        id_vars='# of training examples',
        value_vars=['Training set', 'Validation set'],
        var_name='Scores',
        value_name='F1 score'
    )

We're now ready to use Seaborn to plot learning curves for the classifier.

In [23]:
g = sns.FacetGrid(
    make_tidy(sample_space, train_scores, valid_scores), hue='Scores', size=5
)

g.map(plt.scatter, '# of training examples', 'F1 score')
g.map(plt.plot, '# of training examples', 'F1 score').add_legend();

I love learning curves—they're teeming with information! Notice the performance on the training set is near perfect regardless of dataset size, which makes sense because we're evaluating the classifier on the same data used to train it. At first, it looks like the classifier is suffering from high variance and is overfitting since the validation scores never reach the same level. However, taking a closer look at the scale of the $y$-axis makes me believe this issue isn't that pronounced.

Nevertheless, since the validation scores improve with more training examples, an obvious next step is to acquire more data points. In addition, we could try a smaller set of features by only using bigrams, not unigrams. Finally, it's worth tuning the regularization hyperparameter.

Using nested cross-validation to minimize information leakage

To optimize hyperparameters, we can use Scikit-learn's GridSearchCV tool, which trains a series of candidate models using every combination of hyperparameters, evaluates each model using $k$-fold cross-validation, and then reports the "winning" model and its hyperparameter combination that yielded the best performance. However, we can't report this value as an unbiased estimate of the model's performance because we repeatedly reused the same data for cross-validation—we potentially "leaked" information across the candidate models!

What we can do is utilize nested cross-validation to alleviate this issue. In this procedure, we implement $k$-fold cross-validation to train $k$ models (the outer loop). Using the training set of each fold, we perform GridSearchCV to tune the hyperparameters and select a winning model (the inner loop). Then, using the validation set of each fold, we evaluate the performance of the winning model developed in the inner loop. Finally, by computing the mean of this performance value across the $k$ folds, we can report a robust estimate of the model's performance. This can be a bit confusing so I recommend taking a glance at this visual representation of what's going on.

Using nested cross-validation, let's test a range of 20 values for the regularization hyperparameter and use 10-fold cross-validation to assess the classifier's performance.

In [24]:
param_grid = [{'C': np.logspace(-4, 4, 20)}]

grid_search = GridSearchCV(
    estimator=svm.LinearSVC(loss='hinge'),
    param_grid=param_grid,
    cv=StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=42),
    scoring='f1',
    n_jobs=-1
)

scores = cross_val_score(
    estimator=grid_search,
    X=X_ngrams,
    y=y_enc,
    cv=StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=0),
    scoring='f1',
    n_jobs=-1
)

scores
Out[24]:
array([ 0.91636364,  0.94366197,  0.95104895,  0.93661972,  0.94736842,
        0.93286219,  0.91039427,  0.90510949,  0.9057971 ,  0.94699647])

The scores across the 10 folds demonstrate the classifier's performance is moderately consistent—it depends on what level of variation we're comfortable with. Finally, let's compute the mean score to report an estimate of the classifier's performance.

In [25]:
scores.mean()
Out[25]:
0.92962222115832238

5. What terms are the top predictors of spam?

I'm incredibly curious about which $n$-grams are the most predictive of spam. But first, we need to use the optimal regularization hyperparameter to train the classifier on the whole dataset in order to provide it as much information as possible.

In [26]:
grid_search.fit(X_ngrams, y_enc)
final_clf = svm.LinearSVC(loss='hinge', C=grid_search.best_params_['C'])
final_clf.fit(X_ngrams, y_enc);

Now, let's take a look at the top 20 $n$-grams that are most predictive of spam.

In [27]:
pd.Series(
    final_clf.coef_.T.ravel(),
    index=vectorizer.get_feature_names()
).sort_values(ascending=False)[:20]
Out[27]:
phonenumbr         5.008632
numbrp             2.799188
txt                2.690817
moneysymbnumbr     2.557430
call phonenumbr    2.251018
rington            2.098571
servic             2.049272
mobil              2.036900
numbr              1.896237
tone               1.831285
repli              1.664237
text               1.603976
claim              1.590065
video              1.473553
free               1.359938
wap                1.336547
stop               1.310738
credit             1.278887
uk                 1.239140
order              1.227617
dtype: float64

There are a few obvious ones at the top: phonenumbr, txt, moneysymbnumber (money amount), call phonenumbr, claim, video, free, stop, credit and order. However, the data description points out the majority of the spam in this dataset originated from a British website, while the most of the ham came from Singaporean students. This is quite concerning because the lexicon in SMS messages varies dramatically across different national, cultural and age demographics. This sampling bias may adversely affect the validity of the classifier. Remember what I said earlier about the data you use to train a model?

To finish up, let's write up a function that'll decide whether a string is spam or not, and apply it on the hypothetical message from earlier.

In [28]:
def spam_filter(message):
    if final_clf.predict(vectorizer.transform([preprocess_text(message)])):
        return 'spam'
    else:
        return 'not spam'
In [29]:
spam_filter(example)
Out[29]:
'spam'

The classifier decided that message was indeed spam. Do you agree? Let's try a real SMS message that I just received (spoiler: it's not spam).

In [30]:
spam_filter('Ohhh, but those are the best kind of foods')
Out[30]:
'not spam'

Cool! Go ahead and try your own messages!

6. How to improve the model

We can attempt to boost the classifier's performance but there's simply too many different avenues. However, I'd like to highlight some of the prominent ones.

  1. Text preprocessing
    • I handcrafted the regular expressions in this post (by no means an expert!) but I'm certain we can increase their matching performance and efficiency. It may be also be worthwhile using regex to normalize other terms such as dates, times, slang, etc.
    • I selected the Porter stemmer here but comparing the performance with the Lancaster stemmer would be interesting. However, stemming can be crude and chop off suffixes haphazhardly—a better alternative is lemmatization. For example, the word "worse" reduces to "bad".
    • We ignored the innate features of each training example such as message length, average word length, distribution of the various forms of punctuation, all-caps frequency, etc. Each of these may enhance the classifier's predictive capacity.
  2. Feature engineering
    • I discovered Scikit-learn computes term frequency by simply counting each term. Because SMS messages come in a variety of lengths, normalizing the term frequency to the message length is a good idea.
    • You may have noticed most of the top 20 terms listed above are unigrams. This hints that bigrams aren't as useful in this problem—a bag of words model may be sufficient.
  3. Machine learning
    • Instead of tuning just the classifier's hyperparameters, we can perform a more exhaustive GridSearchCV to compare the use of raw term frequencies instead of tf-idf, or perhaps only tokenizing terms that have a term frequency above/below a threshold. In fact, TfidfVectorizer includes many other knobs to play with. In addition, we can test other classifiers such as Naive Bayes, logistic regression or a neural net. All of these different options can be investigated simultaneously by combining GridSearchCV with Scikit-learn's Pipeline tool.

Of course, the biggest issue in this analysis stems from the dataset itself. We discovered the training examples weren't independently and identically distributed, which breaks an important assumption in machine learning. Therefore, to improve the classifier, it's crucial to not only acquire more training examples but ensure they all come from the same distribution.

You probably realized by now that each time you reported an email as spam on Gmail, you were providing Google one more training example to help improve their classifier! I hope this post presented a glimpse of what NLP offers and how it can be combined with machine learning to tackle a real world problem. I've only scratched the surface—can't wait to dive deeper into the linguistics!

If you'd like to play around with the code, here's the GitHub repo. As always, don't hesitate to leave your comments below.

Comments

Comments powered by Disqus