Dealing with missing values while data pre-processing in Python

Rabin Poudyal
2 min readJun 12, 2018

Data pre-processing is crucial step in any data processing task. Eighty percentage of the total work in data-related analysis and learning is data munging i.e to bring data into form that makes sense for further analysis.

Python has excellent libraries known as modules which are made for data cleaning and pre-processing. We are going to look into the famous sklearn library for this tutorial. Let’s take an example of a relatively small dataset that contains data about average salary of people from some countries:

S.N Country Hours Salary     House
0 France 34.0 12000.0 No
1 Spain 37.0 49000.0 Yes
2 Germany 20.0 34000.0 No
3 Spain 58.0 41000.0 No
4 Germany 40.0 NaN Yes
5 France 45.0 28000.0 Yes
6 Spain NaN 51000.0 No
7 France 28.0 89000.0 Yes
8 Germany 50.0 53000.0 No
9 France 47.0 33000.0 Yes

We can see we have a missing values in Hours worked and Salary columns. One approach to fix this is to remove the rows that contains missing data. But this is not good approach because if we have lots of missing values we are going to lose lots of data. We are going to replace the missing values with the mean of values of the column which makes more sense and we don’t have to lose data.

Let’s start by importing some common libraries required for data cleaning:

Now let’s create the imputer object and define how it should replace missing values.

The result of above is

S.N Country Hours Salary     House
0 France 34.0 12000.0 No
1 Spain 37.0 49000.0 Yes
2 Germany 20.0 34000.0 No
3 Spain 58.0 41000.0 No
4 Germany 40.0 43333.3 Yes
5 France 45.0 28000.0 Yes
6 Spain 39.8 51000.0 No
7 France 28.0 89000.0 Yes
8 Germany 50.0 53000.0 No
9 France 47.0 33000.0 Yes

Some of the strategies you can use are: mean, median and most_frequent. The last one is especially useful when you are dealing with non-numerical columns.

When you print the value now you can see the missing values are now replaced with the mean.

This is how missing value is taken care in python. If you liked the article don’t forget to clap and follow me.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Rabin Poudyal
Rabin Poudyal

Written by Rabin Poudyal

Software Engineer, Data Science Practitioner. Say "Hi!" via email: rabinpoudyal1995@gmail.com or visit my website https://rabinpoudyal.com.np

Responses (1)

Write a response