The problem of dealing with the high cardinal categorical column?

  1. Every column has many unique values
  2. Doing OneHot Endcoing give me 100 additional column
  3. Facing the curse of high dimensionality due to #3
  1. Label Encoding: Use this technique to assign a number to every different value. The cons of this are that the regression model will start taking it as an ordinal variable and compare it within each other. So we cant use this technique if the values in the column have not interrelationship. A snippet for target-based encoding is below:



