Think twice before dropping that first one-hot encoded column

Many machine learning models demand that categorical features are converted to a format they can comprehend via a widely used feature engineering technique called one-hot encoding. Machines aren't that smart.

A common convention after one-hot encoding is to remove one of the one-hot encoded columns from each categorical feature. For example, the feature sex containing values of male and female are transformed into the columns sex_male and sex_female, each containing binary values. Because using either of these columns provides sufficient information to determine a person's sex, we can drop one of them.

In this post, we dive deep into the circumstances where this convention is relevant, necessary, or even prudent.

Table of contents

  1. Preparing the data
  2. Creating a linear regression model with ordinary least-squares
  3. Making the normal equation usable again
  4. Regularizing improves predictions and then some
  5. Don't bother dropping columns when regularizing
  6. Skip dropping columns when using iterative numerical methods
  7. Maybe just stop dropping columns altogether
  8. Conclusions

Preparing the data

Let's generate a toy dataset with three variables; the third column serves as the target variable while the remaining are categorical features. Because we're working with a continuous target variable, we'll create a linear regression model.

In [1]:
# Load packages
import numpy as np
import pandas as pd

# Create training set
training_set = pd.DataFrame(
        ['apple', 'dog', 10],
        ['banana', 'cat', 4],
        ['pear', 'fish', 39],
        ['orange', 'dog', -12],
        ['apple', 'fish', 21],
        ['pear', 'cat', 53],
        ['apple', 'fish', -69]
    columns=['var1', 'var2', 'var3']

var1 var2 var3
0 apple dog 10
1 banana cat 4
2 pear fish 39
3 orange dog -12
4 apple fish 21
5 pear cat 53
6 apple fish -69

We can use the pandas function get_dummies to perform one-hot encoding and generate the feature matrix $\mathbf{X}$.

Let's also add a bias term to $\mathbf{X}$ as a new column so that any model we create isn't confined to passing through the origin.

In [2]:
# One-hot encode categorical features
X = pd.get_dummies(training_set[['var1', 'var2']])

# Add bias column
X['bias'] = np.ones(X.shape[0])

# Display first three rows
var1_apple var1_banana var1_orange var1_pear var2_cat var2_dog var2_fish bias
0 1 0 0 0 0 1 0 1.0
1 0 1 0 0 1 0 0 1.0
2 0 0 0 1 0 0 1 1.0

Finally, let's identify the target variable $\mathbf{y}$.

In [3]:
# Extract target variable
y = training_set['var3']

Creating a linear regression model with ordinary least-squares

In a linear regression model, we express the target variable $\mathbf{y}$ as a linear function of the features $\mathbf{X}$ and some unknown set of parameters $\vec{\theta}$:

$$\mathbf{y} = \mathbf{X}\vec{\theta}$$

The simplest algorithm for finding this "line of best fit" is ordinary least-squares (OLS); it identifies $\vec\theta$ that minimizes the sum of the squared residuals. Therefore, the objective function for OLS is

$$J(\vec{\theta}) = {\left\lVert \mathbf{y} - \mathbf{X}\vec{\theta} \right\rVert_2}^2$$

Next, we have to solve the system of first order partial differential equations $\frac{\partial J}{\partial\vec{\theta}} = 0$, which conveniently has a closed-form solution called the normal equation:

$$\vec{\theta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

Let's apply the normal equation to identify the parameters of the OLS model.

In [4]:
# Compute parameters of OLS model
OLS_theta = np.linalg.inv(X.T @ X) @ (X.T @ y)

# Label parameters with feature names
pd.Series(OLS_theta, index=X.columns)
LinAlgError                               Traceback (most recent call last)
<ipython-input-4-d1b033489f2a> in <module>
      1 # Compute parameters of OLS model
----> 2 OLS_theta = np.linalg.inv(X.T @ X) @ (X.T @ y)
      4 # Label parameters with feature names
      5 pd.Series(OLS_theta, index=X.columns)

~/miniconda3/envs/phoenix/lib/python3.7/site-packages/numpy/linalg/ in inv(a)
    549     signature = 'D->D' if isComplexType(t) else 'd->d'
    550     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 551     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    552     return wrap(ainv.astype(result_t, copy=False))

~/miniconda3/envs/phoenix/lib/python3.7/site-packages/numpy/linalg/ in _raise_linalgerror_singular(err, flag)
     96 def _raise_linalgerror_singular(err, flag):
---> 97     raise LinAlgError("Singular matrix")
     99 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix

NumPy got angry because we tried to invert a singular matrix. Specifically, $\mathbf{X}^T\mathbf{X}$ (the Gram matrix of $\mathbf{X}$) was found to be singular, meaning it doesn't have an inverse. In fact, the Gram matrix is invertible if and only if the columns of $\mathbf{X}$ are linearly independent.

Examining the columns of $\mathbf{X}$, we see that

var1_apple = 1 - (var1_orange + var1_pear + var1_banana)

var2_cat = 1 - (var2_dog + var2_fish)

For any categorical feature, each one-hot encoded column can be expressed as a linear combination of the others—they're perfectly correlated. Therefore, the columns of $\mathbf{X}$ are linearly dependent, which explains the error.

Making the normal equation usable again

By dropping one of the one-hot encoded columns from each categorical feature, we ensure there are no "reference" columns—the remaining columns become linearly independent.

Let's verify this works by implementing it; get_dummies even has a dedicated parameter drop_first.

In [5]:
# One-hot encode categorical features and drop first value column
X_dropped = pd.get_dummies(training_set[['var1', 'var2']], drop_first=True)

# Add bias column
X_dropped['bias'] = np.ones(X.shape[0])

# Display first three rows
var1_banana var1_orange var1_pear var2_dog var2_fish bias
0 0 0 0 1 0 1.0
1 1 0 0 0 0 1.0
2 0 0 1 0 1 1.0

We see that var1_apple and var2_cat were dropped. Let's reattempt to use the normal equation to identify the parameters of the OLS model.

In [6]:
# Compute parameters of OLS model
OLS_theta = np.linalg.inv(X_dropped.T @ X_dropped) @ (X_dropped.T @ y)

# Label parameters with feature names
pd.Series(OLS_theta, index=X_dropped.columns)
var1_banana    14.0
var1_orange   -22.0
var1_pear      63.0
var2_dog       20.0
var2_fish     -14.0
bias          -10.0
dtype: float64

Smooth sailing this time. Therefore, when using the normal equation to create an OLS model, you must drop one of the one-hot encoded columns from each categorical feature.

Regularizing improves predictions and then some

OLS models are handy when we'd like to summarize linear trends for data we already have. When the goal is prediction however, these models are seldom useful because of their numerous pitfalls. In particular, OLS models tend to generalize poorly to new data (aka overfitting).

To prevent overfitting, applying some form of regularization is a no-brainer. $\ell_2$ regularization involves adding a penalty term—square of the $\ell_2$ norm of $\vec{\theta}$—to the objective function. Applying $\ell_2$ regularization to the OLS objective function yields

$$J(\vec{\theta}) = {\left\lVert \mathbf{y} - \mathbf{X}\vec{\theta} \right\rVert_2}^2 + \alpha{\left\lVert \vec{\theta} \right\rVert_2}^2$$

where $\alpha$ is a positive scalar hyperparameter that controls the degree of regularization (higher = more regularization).

We need to solve a new system of partial differential equations $\frac{\partial J}{\partial\vec{\theta}} = 0$; fortunately, it too has a closed-form solution

$$\vec{\theta} = (\mathbf{X}^T\mathbf{X} + \alpha \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$

where $\mathbf{I}$ is an identity matrix with the same dimensions as the Gram matrix. Let's identify the parameters of the $\ell_2$ regularized model using $\alpha = 1$.

In [7]:
def create_L2_reg_model(X, y, alpha):
    Generate a L2 regularized linear regression model.
    This function uses the closed-form solution to compute the parameters of
    an L2 regularized linear regression model.
        X (DataFrame): table containing features
        y (Series): table containing target variable
        alpha (float): positive scalar controlling regularization strength
            (higher = more regularization)
        theta (Series): table containing identified parameters of model

    # Compute identity matrix 
    I = np.identity((X.T @ X).shape[0])

    # Compute parameters
    theta = np.linalg.inv(X.T @ X + alpha * I) @ (X.T @ y)

    # Label parameters with feature names
    theta = pd.Series(theta, index=X.columns)
    return theta

# Create L2 regularized model after dropping columns 
create_L2_reg_model(X_dropped, y, alpha=1)
var1_banana     0.246537
var1_orange    -7.501385
var1_pear      32.678670
var2_dog       -0.504155
var2_fish     -13.049861
bias            3.506925
dtype: float64

A regularized model will generally perform better on new data than an OLS model. In practice however, we'd tune the value of $\alpha$ using cross-validation to maximize model performance.

Don't bother dropping columns when regularizing

Having understood the benefits of regularization, let's try to generate a $\ell_2$ regularized model with the closed-form solution but instead use the original one-hot encoded features prior to dropping any columns. We'll probably run into the singular matrix error again.

In [8]:
# Create L2 regularized model using original one-hot encoded features
create_L2_reg_model(X, y, alpha=1)
var1_apple     -9.617518
var1_banana    -4.306569
var1_orange    -9.543066
var1_pear      27.564964
var2_cat        8.515328
var2_dog        2.988321
var2_fish      -7.405839
bias            4.097810
dtype: float64

Wait, shouldn't NumPy have gotten angry? How were we still able to create a model? The answer is because in the closed-form solution of the $\ell_2$ regularized model above, the matrix $(\mathbf{X}^T\mathbf{X} + \alpha \mathbf{I})$ is almost surely nonsingular. I'll prove it:

  • $(\mathbf{X}^T\mathbf{X})^T = \mathbf{X}^T (\mathbf{X}^T)^T = \mathbf{X}^T \mathbf{X}$. Therefore, $\mathbf{X}^T\mathbf{X}$ is a $n \times n$ symmetric matrix with exactly $n$ eigenvalues $\lambda_i = \lambda_1, \lambda_2, \dots \lambda_n$.
  • When $\alpha = -\lambda_i$, $\det(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I}) = 0$ and, therefore, $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ is singular
  • When $\alpha \neq -\lambda_i$, the eigenvalues of $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ are $(\lambda_1 + \alpha), \dots (\lambda_n + \alpha)$, all of which are nonzero and, therefore, $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ is nonsingular
  • $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ is nonsingular $\forall\{\alpha \in \mathbb{R} \mid \alpha \neq -\lambda_i\}$. Therefore, $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ is almost surely nonsingular

"Almost surely" is an expression from probability theory describing events that occur with $P = 1$ within an infinitely large sample space. Therefore, as long as $\alpha$ isn't the negative of an eigenvalue of $\mathbf{X}^T\mathbf{X}$, there exist infinitely many values of $\alpha$ that make $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ nonsingular. Practically any perturbation of a singular matrix makes it nonsingular!

Consequently, if we apply the tiniest bit of regularization (whether it's $\ell_2$, $\ell_1$, or elastic net), we can handle features that are perfectly correlated without removing any columns. Regularization also innately addresses the effects of multicollinearity—it's pretty awesome.

But if you are regularizing, there's no need to drop one of the one-hot encoded columns from each categorical feature—math's got your back.

Skip dropping columns when using iterative numerical methods

As elegant as they are, the closed-form solutions are seldom utilized in practice. That's because matrix inversion is stupidly expensive. The time complexity of inverting an $n \times n$ matrix is $O(n^3)$ when using Gaussian elimination; more optimized algorithms can bring it down to about $O(n^{2.4})$. Unless it has a few hundred columns (rarely the case with real world datasets), you shouldn't attempt to invert a matrix.

Instead of relying on a closed-form solution, we machine learning practitioners estimate parameters via some efficient iterative numerical method such as gradient descent. Because iterative numerical methods—with or without regularization—don't involve matrix inversions, there's no reason to drop one of the one-hot encoded columns from each categorical feature when using them.

Maybe just stop dropping columns altogether

So far we've discussed a few situations where removing one of the one-hot encoded columns isn't mandatory. However, dropping these columns can also have unforeseen, deleterious consequences.

Did you notice that the parameters between one-hot encoded features had different values depending on whether columns were removed or not? For example, when columns are dropped $\theta_{var1\_banana} = -4.307$ and $\theta_{var2\_dog} = 2.988$; otherwise, $\theta_{var1\_banana} = 0.247$ and $\theta_{var2\_dog} = -0.504$. If we were planning to use these parameters to get a sense of feature importance, dropping columns would tell a whole another story!

Because we alter the model's parameters by dropping one-hot encoded columns, we also change its predictions. What's more alarming is that dropping a different column from each categorical feature yields an entirely new set of parameters.

For example, instead of var1_apple and var2_cat, let's drop var1_banana and var2_dog from the one-hot encoded features.

In [9]:
# Drop different one-hot encoded columns from each categorical feature
X_dropped = X.drop(['var1_banana', 'var2_dog'], axis=1)

# Create L2 regularized model after dropping different set of columns
create_L2_reg_model(X_dropped, y, alpha=1)
var1_apple     -8.651452
var1_orange    -8.199170
var1_pear      28.286307
var2_cat        6.639004
var2_fish      -8.294606
bias            4.398340
dtype: float64

If we arrive at a different model depending on the particular set of columns removed, how do we pick the right model? There's no good answer here—removing columns isn't trivial. You're better off staying objective and leaving one-hot encoded features alone.


Feature engineering is the most important aspect of creating an effective model—you want to get it right. When dealing with categorical features, a common convention is to drop one of the one-hot encoded columns from each feature. Here we discovered this convention is only required when creating an OLS model with the normal equation.

However, a cornerstone of machine learning is to produce a highly predictive model; therefore, we rarely turn to OLS models and always apply regularization. Even if we were to create a $\ell_2$ regularized model with a closed-form solution, the gorgeous math behind regularization would lift the obligation of removing one-hot encoded columns.

Nevertheless, the normal equation and other closed-form solutions are seldom practical due to their computational cost. Instead, we machine learning practitioners prefer creating linear regression models using iterative numerical methods that don't demand dropping one-hot encoded columns.

Finally, we found that dropping one-hot encoded columns tampers with a linear regression model's parameters and predictions. We also end up with a distinct model depending on which set of columns we happened to drop.

In summary, we've uncovered one unlikely usecase where removing one of the one-hot encoded from each categorical feature is crucial for creating a linear regression model, two common situations when it's unnecessary, and two reasons why it's perilous. I'll leave it to you.

What about logistic regression? The same reasons actually apply to generalized linear models. There's even less of a reason to drop one-hot encoded columns when using logistic regression because there is no known closed-form solution for identifying its parameters. We always rely on an iterative numerical method. That is, unless your training set has two examples.

Side note: I recommend avoiding pandas' get_dummies and switching to a more robust one-hot encoder, such as OneHotEncoder from scikit-learn—it's designed to handle these frequent scenarios:

  • A categorical feature containing values that appear in the test set but not the training set
  • A categorical feature in the test set containing a subset of the total possible values

Notice how OneHotEncoder doesn't let us drop one-hot encoded columns...


Comments powered by Disqus