Many a time we as aspiring Data Scientists, I often think what if:
- My dataset has many categorical columns
- Every column has many unique values
- Doing OneHot Endcoing give me 100 additional column
- Facing the curse of high dimensionality due to #3
Luckily Pandas and Sklearn give us quite a few functionalities that deal with high cardinal category columns. As a novice, we always see OHC (One Hot Encoding) for dummying the categorical values but the biggest drawback of this approach is that OHC always adds columns whose number will depend upon the number of different unique values in the column. Imagine if you 10 categorical columns and an average 15 unique value in each column. Your data frame will have 140 additional columns. This will definitely impact model performance, model efficiency, and model accuracy. Few encoding techniques can be as below:
- OneHot Encoding: Use this technique if the column has only a few unique different values like 3–5 values.
- Label Encoding: Use this technique to assign a number to every different value. The cons of this are that the regression model will start taking it as an ordinal variable and compare it within each other. So we cant use this technique if the values in the column have not interrelationship. A snippet for target-based encoding is below:
- Target Base encoding: We can use this technique based on its relationship with the target/dependent variable. we can group by each value in the column and get the target variable's average(if Target variable is continuous) and mode(if target variable is categorical) and then assign each categorical value with the target variable's average or mode.