Think twice before dropping that first one-hot encoded column

Red Huq

2019-05-06 19:30

Many machine learning models demand that categorical features are converted to a format they can comprehend via a widely used feature engineering technique called one-hot encoding. Machines aren't that smart.

A common convention after one-hot encoding is to remove one of the one-hot encoded columns from each categorical feature. For example, the feature sex containing values of male and female are transformed into the columns sex_male and sex_female, each containing binary values. Because using either of these columns provides sufficient information to determine a person's sex, we can drop one of them.

In this post, we dive deep into the circumstances where this convention is relevant, necessary, or even prudent.

Table of contents¶

Preparing the data
Creating a linear regression model with ordinary least-squares
Making the normal equation usable again
Regularizing improves predictions and then some
Don't bother dropping columns when regularizing
Skip dropping columns when using iterative numerical methods
Maybe just stop dropping columns altogether
Conclusions

Preparing the data¶

Let's generate a toy dataset with three variables; the third column serves as the target variable while the remaining are categorical features. Because we're working with a continuous target variable, we'll create a linear regression model.

In [1]:

# Load packages
import numpy as np
import pandas as pd

# Create training set
training_set = pd.DataFrame(
    [
        ['apple', 'dog', 10],
        ['banana', 'cat', 4],
        ['pear', 'fish', 39],
        ['orange', 'dog', -12],
        ['apple', 'fish', 21],
        ['pear', 'cat', 53],
        ['apple', 'fish', -69]
    ],
    columns=['var1', 'var2', 'var3']
)

training_set

Out[1]:

	var1	var2	var3
0	apple	dog	10
1	banana	cat	4
2	pear	fish	39
3	orange	dog	-12
4	apple	fish	21
5	pear	cat	53
6	apple	fish	-69

We can use the pandas function get_dummies to perform one-hot encoding and generate the feature matrix $\mathbf{X}$.

Let's also add a bias term to $\mathbf{X}$ as a new column so that any model we create isn't confined to passing through the origin.

In [2]:

# One-hot encode categorical features
X = pd.get_dummies(training_set[['var1', 'var2']])

# Add bias column
X['bias'] = np.ones(X.shape[0])

# Display first three rows
X.head(3)

Out[2]:

	var1_apple	var1_banana	var1_pear	var2_cat	var2_dog	var2_fish	bias
0	1	0	0	0	1	0	1.0
1	0	1	0	1	0	0	1.0
2	0	0	1	0	0	1	1.0

Finally, let's identify the target variable $\mathbf{y}$.

In [3]:

# Extract target variable
y = training_set['var3']

Creating a linear regression model with ordinary least-squares¶

In a linear regression model, we express the target variable $\mathbf{y}$ as a linear function of the features $\mathbf{X}$ and some unknown set of parameters $\vec{\theta}$:

$$\mathbf{y} = \mathbf{X}\vec{\theta}$$

The simplest algorithm for finding this "line of best fit" is ordinary least-squares (OLS); it identifies $\vec\theta$ that minimizes the sum of the squared residuals. Therefore, the objective function for OLS is

$$J(\vec{\theta}) = {\left\lVert \mathbf{y} - \mathbf{X}\vec{\theta} \right\rVert_2}^2$$

Next, we have to solve the system of first order partial differential equations $\frac{\partial J}{\partial\vec{\theta}} = 0$, which conveniently has a closed-form solution called the normal equation:

$$\vec{\theta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

Let's apply the normal equation to identify the parameters of the OLS model.

In [4]:

# Compute parameters of OLS model
OLS_theta = np.linalg.inv(X.T @ X) @ (X.T @ y)

# Label parameters with feature names
pd.Series(OLS_theta, index=X.columns)

---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-4-d1b033489f2a> in <module>
      1 # Compute parameters of OLS model
----> 2 OLS_theta = np.linalg.inv(X.T @ X) @ (X.T @ y)
      3 
      4 # Label parameters with feature names
      5 pd.Series(OLS_theta, index=X.columns)

~/miniconda3/envs/phoenix/lib/python3.7/site-packages/numpy/linalg/linalg.py in inv(a)
    549     signature = 'D->D' if isComplexType(t) else 'd->d'
    550     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 551     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    552     return wrap(ainv.astype(result_t, copy=False))
    553 

~/miniconda3/envs/phoenix/lib/python3.7/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
     95 
     96 def _raise_linalgerror_singular(err, flag):
---> 97     raise LinAlgError("Singular matrix")
     98 
     99 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix

NumPy got angry because we tried to invert a singular matrix. Specifically, $\mathbf{X}^T\mathbf{X}$ (the Gram matrix of $\mathbf{X}$) was found to be singular, meaning it doesn't have an inverse. In fact, the Gram matrix is invertible if and only if the columns of $\mathbf{X}$ are linearly independent.

Examining the columns of $\mathbf{X}$, we see that

var1_apple = 1 - (var1_orange + var1_pear + var1_banana)

var2_cat = 1 - (var2_dog + var2_fish)

For any categorical feature, each one-hot encoded column can be expressed as a linear combination of the others—they're perfectly correlated. Therefore, the columns of $\mathbf{X}$ are linearly dependent, which explains the error.

Making the normal equation usable again¶

By dropping one of the one-hot encoded columns from each categorical feature, we ensure there are no "reference" columns—the remaining columns become linearly independent.

Let's verify this works by implementing it; get_dummies even has a dedicated parameter drop_first.

In [5]:

# One-hot encode categorical features and drop first value column
X_dropped = pd.get_dummies(training_set[['var1', 'var2']], drop_first=True)

# Add bias column
X_dropped['bias'] = np.ones(X.shape[0])

# Display first three rows
X_dropped.head(3)

Out[5]:

	var1_banana	var1_pear	var2_dog	var2_fish	bias
0	0	0	1	0	1.0
1	1	0	0	0	1.0
2	0	1	0	1	1.0

We see that var1_apple and var2_cat were dropped. Let's reattempt to use the normal equation to identify the parameters of the OLS model.

In [6]:

# Compute parameters of OLS model
OLS_theta = np.linalg.inv(X_dropped.T @ X_dropped) @ (X_dropped.T @ y)

# Label parameters with feature names
pd.Series(OLS_theta, index=X_dropped.columns)

Out[6]:

var1_banana    14.0
var1_orange   -22.0
var1_pear      63.0
var2_dog       20.0
var2_fish     -14.0
bias          -10.0
dtype: float64

Smooth sailing this time. Therefore, when using the normal equation to create an OLS model, you must drop one of the one-hot encoded columns from each categorical feature.

Regularizing improves predictions and then some¶

OLS models are handy when we'd like to summarize linear trends for data we already have. When the goal is prediction however, these models are seldom useful because of their numerous pitfalls. In particular, OLS models tend to generalize poorly to new data (aka overfitting).

To prevent overfitting, applying some form of regularization is a no-brainer. $\ell_2$ regularization involves adding a penalty term—square of the $\ell_2$ norm of $\vec{\theta}$—to the objective function. Applying $\ell_2$ regularization to the OLS objective function yields

$$J(\vec{\theta}) = {\left\lVert \mathbf{y} - \mathbf{X}\vec{\theta} \right\rVert_2}^2 + \alpha{\left\lVert \vec{\theta} \right\rVert_2}^2$$

where $\alpha$ is a positive scalar hyperparameter that controls the degree of regularization (higher = more regularization).

We need to solve a new system of partial differential equations $\frac{\partial J}{\partial\vec{\theta}} = 0$; fortunately, it too has a closed-form solution

$$\vec{\theta} = (\mathbf{X}^T\mathbf{X} + \alpha \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$

where $\mathbf{I}$ is an identity matrix with the same dimensions as the Gram matrix. Let's identify the parameters of the $\ell_2$ regularized model using $\alpha = 1$.

In [7]:

def create_L2_reg_model(X, y, alpha):
    """
    Generate a L2 regularized linear regression model.
    
    This function uses the closed-form solution to compute the parameters of
    an L2 regularized linear regression model.
    
    Args:
        X (DataFrame): table containing features
        y (Series): table containing target variable
        alpha (float): positive scalar controlling regularization strength
            (higher = more regularization)
    
    Returns:
        theta (Series): table containing identified parameters of model

    """
    # Compute identity matrix 
    I = np.identity((X.T @ X).shape[0])

    # Compute parameters
    theta = np.linalg.inv(X.T @ X + alpha * I) @ (X.T @ y)

    # Label parameters with feature names
    theta = pd.Series(theta, index=X.columns)
    
    return theta

# Create L2 regularized model after dropping columns 
create_L2_reg_model(X_dropped, y, alpha=1)

Out[7]:

var1_banana     0.246537
var1_orange    -7.501385
var1_pear      32.678670
var2_dog       -0.504155
var2_fish     -13.049861
bias            3.506925
dtype: float64

A regularized model will generally perform better on new data than an OLS model. In practice however, we'd tune the value of $\alpha$ using cross-validation to maximize model performance.

Don't bother dropping columns when regularizing¶

Having understood the benefits of regularization, let's try to generate a $\ell_2$ regularized model with the closed-form solution but instead use the original one-hot encoded features prior to dropping any columns. We'll probably run into the singular matrix error again.

In [8]:

# Create L2 regularized model using original one-hot encoded features
create_L2_reg_model(X, y, alpha=1)

Out[8]:

var1_apple     -9.617518
var1_banana    -4.306569
var1_orange    -9.543066
var1_pear      27.564964
var2_cat        8.515328
var2_dog        2.988321
var2_fish      -7.405839
bias            4.097810
dtype: float64

Wait, shouldn't NumPy have gotten angry? How were we still able to create a model? The answer is because in the closed-form solution of the $\ell_2$ regularized model above, the matrix $(\mathbf{X}^T\mathbf{X} + \alpha \mathbf{I})$ is almost surely nonsingular. I'll prove it:

$(\mathbf{X}^T\mathbf{X})^T = \mathbf{X}^T (\mathbf{X}^T)^T = \mathbf{X}^T \mathbf{X}$. Therefore, $\mathbf{X}^T\mathbf{X}$ is a $n \times n$ symmetric matrix with exactly $n$ eigenvalues $\lambda_i = \lambda_1, \lambda_2, \dots \lambda_n$.
When $\alpha = -\lambda_i$, $\det(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I}) = 0$ and, therefore, $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ is singular
When $\alpha \neq -\lambda_i$, the eigenvalues of $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ are $(\lambda_1 + \alpha), \dots (\lambda_n + \alpha)$, all of which are nonzero and, therefore, $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ is nonsingular
$(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ is nonsingular $\forall\{\alpha \in \mathbb{R} \mid \alpha \neq -\lambda_i\}$. Therefore, $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ is almost surely nonsingular

"Almost surely" is an expression from probability theory describing events that occur with $P = 1$ within an infinitely large sample space. Therefore, as long as $\alpha$ isn't the negative of an eigenvalue of $\mathbf{X}^T\mathbf{X}$, there exist infinitely many values of $\alpha$ that make $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ nonsingular. Practically any perturbation of a singular matrix makes it nonsingular!

Consequently, if we apply the tiniest bit of regularization (whether it's $\ell_2$, $\ell_1$, or elastic net), we can handle features that are perfectly correlated without removing any columns. Regularization also innately addresses the effects of multicollinearity—it's pretty awesome.

But if you are regularizing, there's no need to drop one of the one-hot encoded columns from each categorical feature—math's got your back.

Skip dropping columns when using iterative numerical methods¶

As elegant as they are, the closed-form solutions are seldom utilized in practice. That's because matrix inversion is stupidly expensive. The time complexity of inverting an $n \times n$ matrix is $O(n^3)$ when using Gaussian elimination; more optimized algorithms can bring it down to about $O(n^{2.4})$. Unless it has a few hundred columns (rarely the case with real world datasets), you shouldn't attempt to invert a matrix.

Instead of relying on a closed-form solution, we machine learning practitioners estimate parameters via some efficient iterative numerical method such as gradient descent. Because iterative numerical methods—with or without regularization—don't involve matrix inversions, there's no reason to drop one of the one-hot encoded columns from each categorical feature when using them.

Maybe just stop dropping columns altogether¶

So far we've discussed a few situations where removing one of the one-hot encoded columns isn't mandatory. However, dropping these columns can also have unforeseen, deleterious consequences.

Did you notice that the parameters between one-hot encoded features had different values depending on whether columns were removed or not? For example, when columns are dropped $\theta_{var1\_banana} = -4.307$ and $\theta_{var2\_dog} = 2.988$; otherwise, $\theta_{var1\_banana} = 0.247$ and $\theta_{var2\_dog} = -0.504$. If we were planning to use these parameters to get a sense of feature importance, dropping columns would tell a whole another story!

Because we alter the model's parameters by dropping one-hot encoded columns, we also change its predictions. What's more alarming is that dropping a different column from each categorical feature yields an entirely new set of parameters.

For example, instead of var1_apple and var2_cat, let's drop var1_banana and var2_dog from the one-hot encoded features.

In [9]:

# Drop different one-hot encoded columns from each categorical feature
X_dropped = X.drop(['var1_banana', 'var2_dog'], axis=1)

# Create L2 regularized model after dropping different set of columns
create_L2_reg_model(X_dropped, y, alpha=1)

Out[9]:

var1_apple     -8.651452
var1_orange    -8.199170
var1_pear      28.286307
var2_cat        6.639004
var2_fish      -8.294606
bias            4.398340
dtype: float64

If we arrive at a different model depending on the particular set of columns removed, how do we pick the right model? There's no good answer here—removing columns isn't trivial. You're better off staying objective and leaving one-hot encoded features alone.

Conclusions¶

Feature engineering is the most important aspect of creating an effective model—you want to get it right. When dealing with categorical features, a common convention is to drop one of the one-hot encoded columns from each feature. Here we discovered this convention is only required when creating an OLS model with the normal equation.

However, a cornerstone of machine learning is to produce a highly predictive model; therefore, we rarely turn to OLS models and always apply regularization. Even if we were to create a $\ell_2$ regularized model with a closed-form solution, the gorgeous math behind regularization would lift the obligation of removing one-hot encoded columns.

Nevertheless, the normal equation and other closed-form solutions are seldom practical due to their computational cost. Instead, we machine learning practitioners prefer creating linear regression models using iterative numerical methods that don't demand dropping one-hot encoded columns.

Finally, we found that dropping one-hot encoded columns tampers with a linear regression model's parameters and predictions. We also end up with a distinct model depending on which set of columns we happened to drop.

In summary, we've uncovered one unlikely usecase where removing one of the one-hot encoded from each categorical feature is crucial for creating a linear regression model, two common situations when it's unnecessary, and two reasons why it's perilous. I'll leave it to you.

What about logistic regression? The same reasons actually apply to generalized linear models. There's even less of a reason to drop one-hot encoded columns when using logistic regression because there is no known closed-form solution for identifying its parameters. We always rely on an iterative numerical method. That is, unless your training set has two examples.

Side note: I recommend avoiding pandas' get_dummies and switching to a more robust one-hot encoder, such as OneHotEncoder from scikit-learn—it's designed to handle these frequent scenarios:

A categorical feature containing values that appear in the test set but not the training set
A categorical feature in the test set containing a subset of the total possible values

Notice how OneHotEncoder doesn't let us drop one-hot encoded columns...