Encoding categorical data in python

3 min readJun 13, 2018

Machine learning models are based on the numerical equations and calculation of numerical variables. But most of the time we have columns in our dataset that is non-numeric such as countries, names, cities and so on. In such condition we need to convert those columns into numeric values which can be used for further processing.

In python we have built in libraries that can help us accomplish these tasks. We will use sklearn module for this example. LabelEncoder is the class that is present in sklearn module for encoding text labels to numerical values. Let’s take an example of simple dataset:

S.N Country Hours Salary     House
0   France  34.0  12000.0        No
1    Spain  37.0  49000.0       Yes
2  Germany  20.0  34000.0        No
3    Spain  58.0  41000.0        No
4  Germany  40.0  43333.3       Yes
5   France  45.0  28000.0       Yes
6    Spain  39.8 51000.0        No
7   France  28.0  89000.0       Yes
8  Germany  50.0  53000.0        No
9   France  47.0  33000.0       Yes

Here, we need to encode Country and House columns. Lets start from importing the frequently used modules for data science.

The result is now

array([[0, 0, 34.0, 12000.0],
       [1, 2, 37.0, 49000.0],
       [2, 1, 20.0, 34000.0],
       [3, 2, 58.0, 41000.0],
       [4, 1, 40.0, 43333.3],
       [5, 0, 45.0, 28000.0],
       [6, 2, 39.8, 51000.0],
       [7, 0, 28.0, 89000.0],
       [8, 1, 50.0, 53000.0],
       [9, 0, 47.0, 33000.0]], dtype=object)

If you take a look into the second column now then the second column is encoded into numers 0, 1, 2. Here the value ranges from 0 to the number of classes present. We had 3 distinct classes so we have encoding 0, 1 and 2.

But wait! We have a little problem with this approach. We have the encoded values as 0, 1 and 2. Now machine learning alogrithm may think class 2 is greater than class 1 and class one is greater than class 0 i.e Germany is greater than Spain and Spain is greater than France. But that does not make sense isn’t it? There is no relative order here. It would make more sense if it was something like size of T-shirts like XL, L, M, SM because there is a relative order.

To prevent machine learning algorithm from thinking there is relative order between variables we have a simple approach in python. We introduce dummy variable to prevent this. Instead of having one column representing the column, we want to have (num of classes) columns that represent the value of that category. For example:

This is achieved by using OneHotEncoder class from sklearn:

array([[1.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 3.40000e+01,
        1.20000e+04],
       [0.00000e+00, 0.00000e+00, 1.00000e+00, 1.00000e+00, 3.70000e+01,
        4.90000e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, 2.00000e+00, 2.00000e+01,
        3.40000e+04],
       [0.00000e+00, 0.00000e+00, 1.00000e+00, 3.00000e+00, 5.80000e+01,
        4.10000e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, 4.00000e+00, 4.00000e+01,
        4.33333e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, 5.00000e+00, 4.50000e+01,
        2.80000e+04],
       [0.00000e+00, 0.00000e+00, 1.00000e+00, 6.00000e+00, 3.98000e+01,
        5.10000e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, 7.00000e+00, 2.80000e+01,
        8.90000e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, 8.00000e+00, 5.00000e+01,
        5.30000e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, 9.00000e+00, 4.70000e+01,
        3.30000e+04]])

Now we have the final column i.e dependent variable Has House we want to enocode. We don’t need to perform OneHotEncoder in this attribute because machine learning models will know there is no relative order between the attributes.

The output is

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

So this is how label encoding is done in python. If you like the article, don’t forget to clap and follow me on medium.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Rabin Poudyal

272 Followers

27 Following

Software Engineer, Data Science Practitioner. Say "Hi!" via email: rabinpoudyal1995@gmail.com or visit my website https://rabinpoudyal.com.np

Responses (3)

Write a response

What are your thoughts?

Also publish to my profile

Anand

Sep 4, 2019

There’s a small error in your code at Line 12 (first chunk), it should be “X = dataset.iloc[:, :-1].values” instead of “X = dataset.iloc[:, -1].values”

Marcel Aguilar Garcia

Jul 20, 2019

Thanks for the article.
We are trying to build a model to predict time off and sick days for a large number of people.
We want to use personal information such as location, age, etc. to train the model but at the same time we were thinking to add the…

Dmitry Pasichnik

Dec 4, 2018

There is no relative order here. It would make more sense if it was something like size of T-shirts like XL, L, M, SM because there is a relative order.

Is there? Is it because more fabric goes to a bigger size?

More from Rabin Poudyal

Building a knowledge graph in python from scratch

Rabin Poudyal

Building a knowledge graph in python from scratch

A knowledge graph is one of the widely used applications of machine learning that tech giants like Google and Microsoft are using in their…

Jan 12, 2020

Build a recommendation engine from scratch for your university project

Rabin Poudyal

Build a recommendation engine from scratch for your university project

Almost every CS student need to complete a final year project. There is a lot of confusion in what language to choose, what frameworks to…

Sep 7, 2018

409

Content Based Filtering in Recommendation Systems

Rabin Poudyal

Content Based Filtering in Recommendation Systems

This is one of the simple approach of recommending products or contents to the user. The idea here is that if a user indicates (s)he likes…

Jun 21, 2018

Nearest neighbour based method for collaborative filtering

Rabin Poudyal

Nearest neighbour based method for collaborative filtering

It is one of the method for performing collaborative filtering. If collaborative filtering is new to you don’t forget to read this article…

Jun 25, 2018

See all from Rabin Poudyal

Recommended from Medium

20 Cutting-Edge Statistical Techniques Every Data Scientist Should Master in 2025

The Data Beast

20 Cutting-Edge Statistical Techniques Every Data Scientist Should Master in 2025

In today’s fast-paced data world, traditional methods are evolving rapidly. In 2025, the fusion of classical statistics, AI, and modern…

6d ago

Feature Selection Techniques in Machine Learning

JABERI Mohamed Habib

Feature Selection Techniques in Machine Learning

Feature selection is a critical step in the data preprocessing phase of machine learning. It involves selecting a subset of relevant…

Sep 27, 2024

Lists

Predictive Modeling w/ Python

20 stories1857 saves

Practical Guides to Machine Learning

10 stories2225 saves

Natural Language Processing

1977 stories1620 saves

data science and AI

40 stories340 saves

Interpreting Support Vector Machine Coefficients: A Comprehensive Analysis

D.H. Jang

Interpreting Support Vector Machine Coefficients: A Comprehensive Analysis

In the rapidly advancing landscape of artificial intelligence (AI) and machine learning (ML), specific methodologies and their…

Nov 3, 2024

How I processed ONE billion rows in PySpark without crashing (and You Can Too!)

Varsha C Bendre

How I processed ONE billion rows in PySpark without crashing (and You Can Too!)

Ever tried running a PySpark job on 1 billion rows, only to watch it crash and burn?

Feb 19

Sunghyun Ahn

Deep Copy vs Shallow Copy in Python

I had the Understanding Object Mutability and Memory Management to Avoid Unintended Side Effectshardest time wrapping my head around the…

Oct 23, 2024

Data Science All Algorithm Cheatsheet 2025

Artificial Intelligence in Plain English

Ritesh Gupta

Data Science All Algorithm Cheatsheet 2025

Stories, strategies, and secrets to choosing the perfect algorithm.

Jan 5

1.4K

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams