Often in our machine learning model, we encounter qualitative predictors.
One of the ways to deal with these factors is to create a dummy variable. If a factor has two levels or possible values, then we simply create a dummy variable that takes two possible numerical values. For example, if gender
variable takes two values – Male
and Female
.
and include this new variable in our model. This results in the model:
\[y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}\]If a qualitative predictor has more than two levels, we create additional dummy variables. For example, if quality
has three possible values – Bad
, Medium
, and Good
.
We deal with these factors in the following ways:
One hot encoding is one of the widespread method used to deal with the categorical values. Another method is dummy encoding. There is a slight difference between the two.
For example, if one categorical variable has n values. One-hot encoding converts it into n variables, while dummy encoding converts it into n-1 variables.
By default, python’s get_dummies
doesn’t do dummy encoding but one-hot encoding. To produce dummy encoding, use drop_first=True
.
scikit-learn
also provides methods to deal with categorical variables e.g. sklearn.preprocessing.LabelEncoder()
. LabelEncoder
is incremental encoding, such as 0,1,2,3,4,… We can also use scikit-learn’s sklearn.preprocessing.OneHotEncoder()
. There is one more type of encoding, Frequency encoding, which maps categories to their frequencies.
Label and Frequency encodings are often used for tree based methods while one hot encoding is used for non-tree-based models.
References: