Train / Test split and cross-validation in Python

4 min readJun 15, 2018

In machine learning, before running any algorithm in our dataset we need to divide our dataset into two sets one called training set and another test set. Imagine that we trained a machine learning model, now we need some way to measure how well our model is performing in new data which our model have not seen before. So we need two type of dataset, one to build the model and another to test the performance of our model. The performance of our test set should not be that different from our training set which signifies our machine learning model doing good and generalizing the examples from our dataset rather than rote learning them.

Imagine a student being prepared for exam. He practices all the questions from the past papers and now goes to the exam hall. When the examiner handles him the question he sees there are no any past questions from previous exams. Now the student could not solve any questions even if he can his grades would be so low. If all the questions were taken from previous exams, then he could have scored highest marks in his class.

Overfitting

The condition when learning algorithm tries to remember all the examples from training set ( rote learning ) is known as overfitting of the model. This happens when there are lot of features in our model or our model is too complex.

Underfitting

The condition when learning algorithm could not learn the correlations between attributes of dataset properly then it is known as underfitting of the model. Our model misses the trends or patterns in the data and could not generalize well for the training set.

Test/train splitting of dataset helps us to prevent these two diseases of machine learning algorithms.

If we train our model on some data and later measure the the performance of our model with same data then it is known as training accuracy. If we try to maximize our training accuracy then the result can be a complex model which can overfit our training data.

Train/Test Splitting

If we train our model on one dataset ( training dataset ) and test the performance of our model in another dataset ( testing dataset ) then the measure of performance using test dataset is known as testing accuracy. This is a better estimate than training accuracy.

Training error is also known as In-Sample-Error(ISE) and Testing error is also known as Out-Of-Sample-Error(OSE).

We generally divide our dataset into train and test sets. The training set contains the data with well-labeled examples. This set is for building our model. The model learns from those well-labeled dataset and generalizes the examples or learns the correlation between the datas in dataset.

Generally, we split our dataset according to 80/20 rule i.e 80% of dataset goes to training set and 20% goes to test set. We are going to use test_train_split method of sklearn library to perform this task in python.

Consider our dataset is of following form

S.N Country Hours Salary     House
0   France  34.0  12000.0        No
1    Spain  37.0  49000.0       Yes
2  Germany  20.0  34000.0        No
3    Spain  58.0  41000.0        No
4  Germany  40.0  43333.3       Yes
5   France  45.0  28000.0       Yes
6    Spain  39.8 51000.0        No
7   France  28.0  89000.0       Yes
8  Germany  50.0  53000.0        No
9   France  47.0  33000.0       Yes

Now data from 0–7( 80% ) must go into the training set and data (8 and 9) must go into test set. Lets write the code to achieve this.

Drawback of Train/Test split

It provides high variance estimate since changing which observations or examples happens to be in testing dataset can significantly change testing accuracy.

Now you may say what if we split the datasets into bunch of train/test splits, calculate their training accuracy and average the results together. That is where cross-validation comes to play. The common type of cross validation is k-fold cross validation.

K-fold cross-validation

In this process, we split the dataset into K-equal partitions or folds. Then use one of the fold as the testing set and union of remaining sets as training set. Then we calculate the testing accuracy of our model. We repeat the process of choosing train and test from different folds, and calculating training error K-times(number of folds). Then we use average training accuracy as the estimate.

E.g if we have 150 rows in our dataset, and say our fold size is 5. Then we have 150/5 = 30 rows in each folds( say fold1, fold2, fold3, fold4 and fold5). Then we need to iterate 5 times. For first iteration our testing set is fold1 and remaining are training set. Then we calculate the testing error and say this be error1. In next iteration testing set is fold2 and others are training set. Then we get error2. We repeat this 5 times. Then the training accuracy will be:

training accuracy = ( error1 + …. + error5 ) / 5

Now let’s dive into code