Encoding categorical data in python

Machine learning models are based on the numerical equations and calculation of numerical variables. But most of the time we have columns in our dataset that is non-numeric such as countries, names, cities and so on. In such condition we need to convert those columns into numeric values which can be used for further processing.
In python we have built in libraries that can help us accomplish these tasks. We will use sklearn module for this example. LabelEncoder is the class that is present in sklearn module for encoding text labels to numerical values. Let’s take an example of simple dataset:
S.N Country Hours Salary House
0 France 34.0 12000.0 No
1 Spain 37.0 49000.0 Yes
2 Germany 20.0 34000.0 No
3 Spain 58.0 41000.0 No
4 Germany 40.0 43333.3 Yes
5 France 45.0 28000.0 Yes
6 Spain 39.8 51000.0 No
7 France 28.0 89000.0 Yes
8 Germany 50.0 53000.0 No
9 France 47.0 33000.0 Yes
Here, we need to encode Country and House columns. Lets start from importing the frequently used modules for data science.
The result is now
array([[0, 0, 34.0, 12000.0],
[1, 2, 37.0, 49000.0],
[2, 1, 20.0, 34000.0],
[3, 2, 58.0, 41000.0],
[4, 1, 40.0, 43333.3],
[5, 0, 45.0, 28000.0],
[6, 2, 39.8, 51000.0],
[7, 0, 28.0, 89000.0],
[8, 1, 50.0, 53000.0],
[9, 0, 47.0, 33000.0]], dtype=object)
If you take a look into the second column now then the second column is encoded into numers 0, 1, 2. Here the value ranges from 0 to the number of classes present. We had 3 distinct classes so we have encoding 0, 1 and 2.
But wait! We have a little problem with this approach. We have the encoded values as 0, 1 and 2. Now machine learning alogrithm may think class 2 is greater than class 1 and class one is greater than class 0 i.e Germany is greater than Spain and Spain is greater than France. But that does not make sense isn’t it? There is no relative order here. It would make more sense if it was something like size of T-shirts like XL, L, M, SM because there is a relative order.
To prevent machine learning algorithm from thinking there is relative order between variables we have a simple approach in python. We introduce dummy variable to prevent this. Instead of having one column representing the column, we want to have (num of classes) columns that represent the value of that category. For example:
This is achieved by using OneHotEncoder class from sklearn:
array([[1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 3.40000e+01,
1.20000e+04],
[0.00000e+00, 0.00000e+00, 1.00000e+00, 1.00000e+00, 3.70000e+01,
4.90000e+04],
[0.00000e+00, 1.00000e+00, 0.00000e+00, 2.00000e+00, 2.00000e+01,
3.40000e+04],
[0.00000e+00, 0.00000e+00, 1.00000e+00, 3.00000e+00, 5.80000e+01,
4.10000e+04],
[0.00000e+00, 1.00000e+00, 0.00000e+00, 4.00000e+00, 4.00000e+01,
4.33333e+04],
[1.00000e+00, 0.00000e+00, 0.00000e+00, 5.00000e+00, 4.50000e+01,
2.80000e+04],
[0.00000e+00, 0.00000e+00, 1.00000e+00, 6.00000e+00, 3.98000e+01,
5.10000e+04],
[1.00000e+00, 0.00000e+00, 0.00000e+00, 7.00000e+00, 2.80000e+01,
8.90000e+04],
[0.00000e+00, 1.00000e+00, 0.00000e+00, 8.00000e+00, 5.00000e+01,
5.30000e+04],
[1.00000e+00, 0.00000e+00, 0.00000e+00, 9.00000e+00, 4.70000e+01,
3.30000e+04]])
Now we have the final column i.e dependent variable Has House we want to enocode. We don’t need to perform OneHotEncoder in this attribute because machine learning models will know there is no relative order between the attributes.
The output is
array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
So this is how label encoding is done in python. If you like the article, don’t forget to clap and follow me on medium.