Encoding categorical data in python

Rabin Poudyal
3 min readJun 13, 2018

Machine learning models are based on the numerical equations and calculation of numerical variables. But most of the time we have columns in our dataset that is non-numeric such as countries, names, cities and so on. In such condition we need to convert those columns into numeric values which can be used for further processing.

In python we have built in libraries that can help us accomplish these tasks. We will use sklearn module for this example. LabelEncoder is the class that is present in sklearn module for encoding text labels to numerical values. Let’s take an example of simple dataset:

S.N Country Hours Salary     House
0 France 34.0 12000.0 No
1 Spain 37.0 49000.0 Yes
2 Germany 20.0 34000.0 No
3 Spain 58.0 41000.0 No
4 Germany 40.0 43333.3 Yes
5 France 45.0 28000.0 Yes
6 Spain 39.8 51000.0 No
7 France 28.0 89000.0 Yes
8 Germany 50.0 53000.0 No
9 France 47.0 33000.0 Yes

Here, we need to encode Country and House columns. Lets start from importing the frequently used modules for data science.

The result is now

array([[0, 0, 34.0, 12000.0],
[1, 2, 37.0, 49000.0],
[2, 1, 20.0, 34000.0],
[3, 2, 58.0, 41000.0],
[4, 1, 40.0, 43333.3],
[5, 0, 45.0, 28000.0],
[6, 2, 39.8, 51000.0],
[7, 0, 28.0, 89000.0],
[8, 1, 50.0, 53000.0],
[9, 0, 47.0, 33000.0]], dtype=object)

If you take a look into the second column now then the second column is encoded into numers 0, 1, 2. Here the value ranges from 0 to the number of classes present. We had 3 distinct classes so we have encoding 0, 1 and 2.

But wait! We have a little problem with this approach. We have the encoded values as 0, 1 and 2. Now machine learning alogrithm may think class 2 is greater than class 1 and class one is greater than class 0 i.e Germany is greater than Spain and Spain is greater than France. But that does not make sense isn’t it? There is no relative order here. It would make more sense if it was something like size of T-shirts like XL, L, M, SM because there is a relative order.

To prevent machine learning algorithm from thinking there is relative order between variables we have a simple approach in python. We introduce dummy variable to prevent this. Instead of having one column representing the column, we want to have (num of classes) columns that represent the value of that category. For example:

This is achieved by using OneHotEncoder class from sklearn:

array([[1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 3.40000e+01,
1.20000e+04],
[0.00000e+00, 0.00000e+00, 1.00000e+00, 1.00000e+00, 3.70000e+01,
4.90000e+04],
[0.00000e+00, 1.00000e+00, 0.00000e+00, 2.00000e+00, 2.00000e+01,
3.40000e+04],
[0.00000e+00, 0.00000e+00, 1.00000e+00, 3.00000e+00, 5.80000e+01,
4.10000e+04],
[0.00000e+00, 1.00000e+00, 0.00000e+00, 4.00000e+00, 4.00000e+01,
4.33333e+04],
[1.00000e+00, 0.00000e+00, 0.00000e+00, 5.00000e+00, 4.50000e+01,
2.80000e+04],
[0.00000e+00, 0.00000e+00, 1.00000e+00, 6.00000e+00, 3.98000e+01,
5.10000e+04],
[1.00000e+00, 0.00000e+00, 0.00000e+00, 7.00000e+00, 2.80000e+01,
8.90000e+04],
[0.00000e+00, 1.00000e+00, 0.00000e+00, 8.00000e+00, 5.00000e+01,
5.30000e+04],
[1.00000e+00, 0.00000e+00, 0.00000e+00, 9.00000e+00, 4.70000e+01,
3.30000e+04]])

Now we have the final column i.e dependent variable Has House we want to enocode. We don’t need to perform OneHotEncoder in this attribute because machine learning models will know there is no relative order between the attributes.

The output is

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

So this is how label encoding is done in python. If you like the article, don’t forget to clap and follow me on medium.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Rabin Poudyal
Rabin Poudyal

Written by Rabin Poudyal

Software Engineer, Data Science Practitioner. Say "Hi!" via email: rabinpoudyal1995@gmail.com or visit my website https://rabinpoudyal.com.np

Responses (3)

Write a response

There’s a small error in your code at Line 12 (first chunk), it should be “X = dataset.iloc[:, :-1].values” instead of “X = dataset.iloc[:, -1].values”

1

Thanks for the article.
We are trying to build a model to predict time off and sick days for a large number of people.
We want to use personal information such as location, age, etc. to train the model but at the same time we were thinking to add the…

There is no relative order here. It would make more sense if it was something like size of T-shirts like XL, L, M, SM because there is a relative order.

Is there? Is it because more fabric goes to a bigger size?