Train / Test split and cross-validation in Python

Rabin Poudyal
4 min readJun 15, 2018

In machine learning, before running any algorithm in our dataset we need to divide our dataset into two sets one called training set and another test set. Imagine that we trained a machine learning model, now we need some way to measure how well our model is performing in new data which our model have not seen before. So we need two type of dataset, one to build the model and another to test the performance of our model. The performance of our test set should not be that different from our training set which signifies our machine learning model doing good and generalizing the examples from our dataset rather than rote learning them.

Imagine a student being prepared for exam. He practices all the questions from the past papers and now goes to the exam hall. When the examiner handles him the question he sees there are no any past questions from previous exams. Now the student could not solve any questions even if he can his grades would be so low. If all the questions were taken from previous exams, then he could have scored highest marks in his class.

Overfitting

The condition when learning algorithm tries to remember all the examples from training set ( rote learning ) is known as overfitting of the model. This happens when there are lot of features in our model or our model is too complex.

Underfitting

The condition when learning algorithm could not learn the correlations between attributes of dataset properly then it is known as underfitting of the model. Our model misses the trends or patterns in the data and could not generalize well for the training set.

Test/train splitting of dataset helps us to prevent these two diseases of machine learning algorithms.

If we train our model on some data and later measure the the performance of our model with same data then it is known as training accuracy. If we try to maximize our training accuracy then the result can be a complex model which can overfit our training data.

Train/Test Splitting

If we train our model on one dataset ( training dataset ) and test the performance of our model in another dataset ( testing dataset ) then the measure of performance using test dataset is known as testing accuracy. This is a better estimate than training accuracy.

Training error is also known as In-Sample-Error(ISE) and Testing error is also known as Out-Of-Sample-Error(OSE).

We generally divide our dataset into train and test sets. The training set contains the data with well-labeled examples. This set is for building our model. The model learns from those well-labeled dataset and generalizes the examples or learns the correlation between the datas in dataset.

Generally, we split our dataset according to 80/20 rule i.e 80% of dataset goes to training set and 20% goes to test set. We are going to use test_train_split method of sklearn library to perform this task in python.

Consider our dataset is of following form

S.N Country Hours Salary     House
0 France 34.0 12000.0 No
1 Spain 37.0 49000.0 Yes
2 Germany 20.0 34000.0 No
3 Spain 58.0 41000.0 No
4 Germany 40.0 43333.3 Yes
5 France 45.0 28000.0 Yes
6 Spain 39.8 51000.0 No
7 France 28.0 89000.0 Yes
8 Germany 50.0 53000.0 No
9 France 47.0 33000.0 Yes

Now data from 0–7( 80% ) must go into the training set and data (8 and 9) must go into test set. Lets write the code to achieve this.

Drawback of Train/Test split

It provides high variance estimate since changing which observations or examples happens to be in testing dataset can significantly change testing accuracy.

Now you may say what if we split the datasets into bunch of train/test splits, calculate their training accuracy and average the results together. That is where cross-validation comes to play. The common type of cross validation is k-fold cross validation.

K-fold cross-validation

In this process, we split the dataset into K-equal partitions or folds. Then use one of the fold as the testing set and union of remaining sets as training set. Then we calculate the testing accuracy of our model. We repeat the process of choosing train and test from different folds, and calculating training error K-times(number of folds). Then we use average training accuracy as the estimate.

E.g if we have 150 rows in our dataset, and say our fold size is 5. Then we have 150/5 = 30 rows in each folds( say fold1, fold2, fold3, fold4 and fold5). Then we need to iterate 5 times. For first iteration our testing set is fold1 and remaining are training set. Then we calculate the testing error and say this be error1. In next iteration testing set is fold2 and others are training set. Then we get error2. We repeat this 5 times. Then the training accuracy will be:

training accuracy = ( error1 + …. + error5 ) / 5

Now let’s dive into code

Comparison of test/train and K-fold

Advantages of test/train split:

  1. Runs k-times faster than k-fold,
  2. Simpler than k-fold so it would be easier to analyze the testing errors

Advantages of cross-validation:

  1. It is more better estimator of out-of-sample accuracy,
  2. More efficient use of data since each data is used for train and testing

If you like the article, don’t forget to clap and follow me.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Rabin Poudyal
Rabin Poudyal

Written by Rabin Poudyal

Software Engineer, Data Science Practitioner. Say "Hi!" via email: rabinpoudyal1995@gmail.com or visit my website https://rabinpoudyal.com.np

No responses yet

Write a response