# Think twice before dropping that first one-hot encoded column

Many machine learning models demand that categorical features are converted to a format they can comprehend via a widely used feature engineering technique called **one-hot encoding**. Machines aren't *that* smart.

A common convention after one-hot encoding is to remove one of the one-hot encoded columns from each categorical feature. For example, the feature `sex`

containing values of `male`

and `female`

are transformed into the columns `sex_male`

and `sex_female`

, each containing binary values. Because using either of these columns provides sufficient information to determine a person's sex, we can drop one of them.

In this post, we dive deep into the circumstances where this convention is relevant, necessary, or even prudent.

## Table of contents¶

- Preparing the data
- Creating a linear regression model with ordinary least-squares
- Making the normal equation usable again
- Regularizing improves predictions and then some
- Don't bother dropping columns when regularizing
- Skip dropping columns when using iterative numerical methods
- Maybe just stop dropping columns altogether
- Conclusions

## Preparing the data¶

Let's generate a toy dataset with three variables; the third column serves as the target variable while the remaining are categorical features. Because we're working with a continuous target variable, we'll create a linear regression model.

```
# Load packages
import numpy as np
import pandas as pd
# Create training set
training_set = pd.DataFrame(
[
['apple', 'dog', 10],
['banana', 'cat', 4],
['pear', 'fish', 39],
['orange', 'dog', -12],
['apple', 'fish', 21],
['pear', 'cat', 53],
['apple', 'fish', -69]
],
columns=['var1', 'var2', 'var3']
)
training_set
```

We can use the pandas function `get_dummies`

to perform one-hot encoding and generate the feature matrix $\mathbf{X}$.

Let's also add a bias term to $\mathbf{X}$ as a new column so that any model we create isn't confined to passing through the origin.

```
# One-hot encode categorical features
X = pd.get_dummies(training_set[['var1', 'var2']])
# Add bias column
X['bias'] = np.ones(X.shape[0])
# Display first three rows
X.head(3)
```

Finally, let's identify the target variable $\mathbf{y}$.

```
# Extract target variable
y = training_set['var3']
```

## Creating a linear regression model with ordinary least-squares¶

In a linear regression model, we express the target variable $\mathbf{y}$ as a linear function of the features $\mathbf{X}$ and some unknown set of parameters $\vec{\theta}$:

$$\mathbf{y} = \mathbf{X}\vec{\theta}$$

The simplest algorithm for finding this "line of best fit" is **ordinary least-squares (OLS)**; it identifies $\vec\theta$ that minimizes the sum of the squared residuals. Therefore, the objective function for OLS is

$$J(\vec{\theta}) = {\left\lVert \mathbf{y} - \mathbf{X}\vec{\theta} \right\rVert_2}^2$$

Next, we have to solve the system of first order partial differential equations $\frac{\partial J}{\partial\vec{\theta}} = 0$, which conveniently has a closed-form solution called the **normal equation**:

$$\vec{\theta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

Let's apply the normal equation to identify the parameters of the OLS model.

```
# Compute parameters of OLS model
OLS_theta = np.linalg.inv(X.T @ X) @ (X.T @ y)
# Label parameters with feature names
pd.Series(OLS_theta, index=X.columns)
```

NumPy got angry because we tried to invert a singular matrix. Specifically, $\mathbf{X}^T\mathbf{X}$ (the Gram matrix of $\mathbf{X}$) was found to be **singular**, meaning it doesn't have an inverse. In fact, the Gram matrix is invertible if and only if the columns of $\mathbf{X}$ are linearly independent.

Examining the columns of $\mathbf{X}$, we see that

`var1_apple`

= 1 - (`var1_orange`

+ `var1_pear`

+ `var1_banana`

)

`var2_cat`

= 1 - (`var2_dog`

+ `var2_fish`

)

For any categorical feature, each one-hot encoded column can be expressed as a linear combination of the others—they're perfectly correlated. Therefore, the columns of $\mathbf{X}$ are linearly *dependent*, which explains the error.

## Making the normal equation usable again¶

By dropping one of the one-hot encoded columns from each categorical feature, we ensure there are no "reference" columns—the remaining columns become linearly independent.

Let's verify this works by implementing it; `get_dummies`

even has a dedicated parameter `drop_first`

.

```
# One-hot encode categorical features and drop first value column
X_dropped = pd.get_dummies(training_set[['var1', 'var2']], drop_first=True)
# Add bias column
X_dropped['bias'] = np.ones(X.shape[0])
# Display first three rows
X_dropped.head(3)
```

We see that `var1_apple`

and `var2_cat`

were dropped. Let's reattempt to use the normal equation to identify the parameters of the OLS model.

```
# Compute parameters of OLS model
OLS_theta = np.linalg.inv(X_dropped.T @ X_dropped) @ (X_dropped.T @ y)
# Label parameters with feature names
pd.Series(OLS_theta, index=X_dropped.columns)
```

Smooth sailing this time. Therefore, when using the normal equation to create an OLS model, you *must* drop one of the one-hot encoded columns from each categorical feature.

## Regularizing improves predictions and then some¶

OLS models are handy when we'd like to summarize linear trends for data we *already have*. When the goal is prediction however, these models are seldom useful because of their numerous pitfalls. In particular, OLS models tend to generalize poorly to new data (aka overfitting).

To prevent overfitting, applying some form of **regularization** is a no-brainer. $\ell_2$ regularization involves adding a penalty term—square of the $\ell_2$ norm of $\vec{\theta}$—to the objective function. Applying $\ell_2$ regularization to the OLS objective function yields

$$J(\vec{\theta}) = {\left\lVert \mathbf{y} - \mathbf{X}\vec{\theta} \right\rVert_2}^2 + \alpha{\left\lVert \vec{\theta} \right\rVert_2}^2$$

where $\alpha$ is a positive scalar hyperparameter that controls the degree of regularization (higher = more regularization).

We need to solve a new system of partial differential equations $\frac{\partial J}{\partial\vec{\theta}} = 0$; fortunately, it too has a closed-form solution

$$\vec{\theta} = (\mathbf{X}^T\mathbf{X} + \alpha \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$$

where $\mathbf{I}$ is an identity matrix with the same dimensions as the Gram matrix. Let's identify the parameters of the $\ell_2$ regularized model using $\alpha = 1$.

```
def create_L2_reg_model(X, y, alpha):
"""
Generate a L2 regularized linear regression model.
This function uses the closed-form solution to compute the parameters of
an L2 regularized linear regression model.
Args:
X (DataFrame): table containing features
y (Series): table containing target variable
alpha (float): positive scalar controlling regularization strength
(higher = more regularization)
Returns:
theta (Series): table containing identified parameters of model
"""
# Compute identity matrix
I = np.identity((X.T @ X).shape[0])
# Compute parameters
theta = np.linalg.inv(X.T @ X + alpha * I) @ (X.T @ y)
# Label parameters with feature names
theta = pd.Series(theta, index=X.columns)
return theta
# Create L2 regularized model after dropping columns
create_L2_reg_model(X_dropped, y, alpha=1)
```

A regularized model will generally perform better on new data than an OLS model. In practice however, we'd tune the value of $\alpha$ using cross-validation to maximize model performance.

## Don't bother dropping columns when regularizing¶

Having understood the benefits of regularization, let's try to generate a $\ell_2$ regularized model with the closed-form solution but instead use the original one-hot encoded features prior to dropping any columns. We'll probably run into the singular matrix error again.

```
# Create L2 regularized model using original one-hot encoded features
create_L2_reg_model(X, y, alpha=1)
```

Wait, shouldn't NumPy have gotten angry? How were we still able to create a model? The answer is because in the closed-form solution of the $\ell_2$ regularized model above, **the matrix $(\mathbf{X}^T\mathbf{X} + \alpha \mathbf{I})$ is almost surely nonsingular**. I'll prove it:

- $(\mathbf{X}^T\mathbf{X})^T = \mathbf{X}^T (\mathbf{X}^T)^T = \mathbf{X}^T \mathbf{X}$. Therefore, $\mathbf{X}^T\mathbf{X}$ is a $n \times n$ symmetric matrix with exactly $n$ eigenvalues $\lambda_i = \lambda_1, \lambda_2, \dots \lambda_n$.
- When $\alpha = -\lambda_i$, $\det(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I}) = 0$ and, therefore, $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ is singular
- When $\alpha \neq -\lambda_i$, the eigenvalues of $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ are $(\lambda_1 + \alpha), \dots (\lambda_n + \alpha)$, all of which are nonzero and, therefore, $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ is nonsingular
- $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ is nonsingular $\forall\{\alpha \in \mathbb{R} \mid \alpha \neq -\lambda_i\}$. Therefore, $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ is almost surely nonsingular

"Almost surely" is an expression from probability theory describing events that occur with $P = 1$ within an infinitely large sample space. Therefore, as long as $\alpha$ isn't the negative of an eigenvalue of $\mathbf{X}^T\mathbf{X}$, there exist infinitely many values of $\alpha$ that make $(\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})$ nonsingular. Practically any perturbation of a singular matrix makes it nonsingular!

Consequently, if we apply the *tiniest* bit of regularization (whether it's $\ell_2$, $\ell_1$, or elastic net), we can handle features that are perfectly correlated without removing any columns. Regularization also innately addresses the effects of multicollinearity—it's pretty awesome.

But if you *are* regularizing, there's no need to drop one of the one-hot encoded columns from each categorical feature—math's got your back.

## Skip dropping columns when using iterative numerical methods¶

As elegant as they are, the closed-form solutions are seldom utilized in practice. That's because matrix inversion is stupidly expensive. The time complexity of inverting an $n \times n$ matrix is $O(n^3)$ when using Gaussian elimination; more optimized algorithms can bring it down to about $O(n^{2.4})$. Unless it has a few hundred columns (rarely the case with real world datasets), you shouldn't attempt to invert a matrix.

Instead of relying on a closed-form solution, we machine learning practitioners estimate parameters via some efficient iterative numerical method such as gradient descent. Because iterative numerical methods—with or without regularization—don't involve matrix inversions, there's no reason to drop one of the one-hot encoded columns from each categorical feature when using them.

## Maybe just stop dropping columns altogether¶

So far we've discussed a few situations where removing one of the one-hot encoded columns isn't mandatory. However, dropping these columns can also have unforeseen, deleterious consequences.

Did you notice that the parameters between one-hot encoded features had different values depending on whether columns were removed or not? For example, when columns are dropped $\theta_{var1\_banana} = -4.307$ and $\theta_{var2\_dog} = 2.988$; otherwise, $\theta_{var1\_banana} = 0.247$ and $\theta_{var2\_dog} = -0.504$. If we were planning to use these parameters to get a sense of feature importance, dropping columns would tell a whole another story!

Because we alter the model's parameters by dropping one-hot encoded columns, we also change its predictions. What's more alarming is that dropping a different column from each categorical feature yields an entirely new set of parameters.

For example, instead of `var1_apple`

and `var2_cat`

, let's drop `var1_banana`

and `var2_dog`

from the one-hot encoded features.

```
# Drop different one-hot encoded columns from each categorical feature
X_dropped = X.drop(['var1_banana', 'var2_dog'], axis=1)
# Create L2 regularized model after dropping different set of columns
create_L2_reg_model(X_dropped, y, alpha=1)
```

If we arrive at a different model depending on the particular set of columns removed, how do we pick the right model? There's no good answer here—removing columns isn't trivial. You're better off staying objective and leaving one-hot encoded features alone.

## Conclusions¶

Feature engineering is the most important aspect of creating an effective model—you want to get it right. When dealing with categorical features, a common convention is to drop one of the one-hot encoded columns from each feature. Here we discovered this convention is *only required* when creating an OLS model with the normal equation.

However, a cornerstone of machine learning is to produce a highly predictive model; therefore, we rarely turn to OLS models and *always* apply regularization. Even if we were to create a $\ell_2$ regularized model with a closed-form solution, the gorgeous math behind regularization would lift the obligation of removing one-hot encoded columns.

Nevertheless, the normal equation and other closed-form solutions are seldom practical due to their computational cost. Instead, we machine learning practitioners prefer creating linear regression models using iterative numerical methods that don't demand dropping one-hot encoded columns.

Finally, we found that dropping one-hot encoded columns tampers with a linear regression model's parameters and predictions. We also end up with a distinct model depending on which set of columns we happened to drop.

In summary, we've uncovered one unlikely usecase where removing one of the one-hot encoded from each categorical feature is crucial for creating a linear regression model, two common situations when it's unnecessary, and two reasons why it's perilous. I'll leave it to you.

What about *logistic* regression? The same reasons actually apply to generalized linear models. There's even less of a reason to drop one-hot encoded columns when using logistic regression because there is no known closed-form solution for identifying its parameters. We always rely on an iterative numerical method. That is, unless your training set has two examples.

**Side note:** I recommend avoiding pandas' `get_dummies`

and switching to a more robust one-hot encoder, such as `OneHotEncoder`

from scikit-learn—it's designed to handle these frequent scenarios:

- A categorical feature containing values that appear in the test set but not the training set
- A categorical feature in the test set containing a subset of the total possible values

Notice how `OneHotEncoder`

doesn't let us drop one-hot encoded columns...

## Comments

Comments powered by Disqus