Feature scaling in python

When we are pre-processing our dataset for machine learning, we need to scale our data so that all of them are in same scale. Let’s take a small dataset for example:
S.N Country Hours Salary House
0 France 34.0 12000.0 No
1 Spain 37.0 49000.0 Yes
2 Germany 20.0 34000.0 No
3 Spain 58.0 41000.0 No
4 Germany 40.0 43333.3 Yes
5 France 45.0 28000.0 Yes
6 Spain 39.8 51000.0 No
7 France 28.0 89000.0 Yes
8 Germany 50.0 53000.0 No
9 France 47.0 33000.0 Yes
Here, Hours worked by an employee and salary column are of different scale, one is less than 100 and another column is more than thousands. This different scale makes the calculation in machine learning difficult because the parameters of small scale may converge earlier and parameter of higher scale may converge slowly.
There are two main reasons why we need to do feature scaling :
- Most of machine learning algorithms are based on euclidean distance and if we don’t perform feature scaling, one feature may dominate another. In the above dataset, the Hours column has range from 28–58 and Salary column has range 12000–89000. They are not in same scale and if we calculate euclidean distance between rows 0 and 3 then, (58–34)² is very small number compared to (41000–12000)² so salary column may dominate our features.
- Even if algorithms are not based on machine learning, if we perform feature scaling, they run faster and we have a huge performance benefit for large dataset.
Feature Scaling Techniques
1. Standardization
In this technique, we replace the value by its z-score.
z=(x−μ)/σ
The result after standardization is that all the features will be rescaled so that they will have properties of standard normal distribution with μ=0(mean) and σ=1(standard deviation).
2. Mean Normalization
x = (x — mean(x))/(max(x) — min(x))
This normalization will create the distribution of features between -1 and 1 and μ=0.
3. Mean-Max Scaling
x = (x — min(x))/(max(x) — min(x))
This scaling brings the value between 0 and 1. This is used when we have features to be normalized to fit within a certain range i.e in image processing intensity values should be between 0 and 255.
When to scale
- When we are preparing our dataset for Euclidean distance based algorithm like k-nearest neighbor.
- When performing PCA(Principal Component Analysis) for dimensionality reduction.
- To speed up gradient descent.
- Tree based models can handle different range of values so we don’t need scaling for them.
Now let’s see in code how we can do this in python
We assume that X_train and X_test are extracted from the above dataset.
If you like the post, don’t forget to follow me on medium and clap the post.