The problem of dealing with the high cardinal categorical column?

  1. Every column has many unique values
  2. Doing OneHot Endcoing give me 100 additional column
  3. Facing the curse of high dimensionality due to #3
  1. Label Encoding: Use this technique to assign a number to every different value. The cons of this are that the regression model will start taking it as an ordinal variable and compare it within each other. So we cant use this technique if the values in the column have not interrelationship. A snippet for target-based encoding is below:

--

--

Strategic and Technical Consultant, A Cloud change enabler, helping business to meet technology for better ROI.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store