Model Selection Process in Machine Learning

Jeswanth Mukesh
3 min readSep 12, 2022

--

Introduction

Model Selection is the process of selecting best performing Machine learning models for a specific problem by splitting the dataset into Train, Validation, and Test sets.

Model Selection Process video from ML-ZoomCamp by Alexey Grigorev

We split the data into Train and Test datasets. The reason for splitting the dataset is to verify whether the model is also performing well in the data which is not used in the training dataset.

If you know which ML algorithm is best for your problem then you can split data into training and testing sets.

If you don’t know which ML algorithm is best for your problem, then you split the data into three different sets,

  • Training Set
  • Validation Set
  • Test Set
  1. Split datasets in training, validation, and test.
  2. Train the models
  3. Evaluate the models
  4. Select the best model
  5. Apply the best model to the test dataset
  6. Compare the performance metrics of validation and test

Training Set:

The training dataset is used to train the machine learning model to make predictions.

Use greater than or equal to 60% of data to find the optimal weights and bias value in that algorithm.

more data = more accurate result

Validation Set:

A validation set is used to evaluate the best-performing model. Evaluating the model’s performance by testing its accuracy with the data that is not used in the training set.

If you are already splitted data into 80:20 ratio for training and test set, then take another 20% from the training set for validation set.

Train a machine learning models with the 60% data, and test the accuracy with the validation set data.

By comparing the accuracy of the Machine learning models, we can choose the best algorithm that performs well in the validation set.

After choosing the best model then add the validation set with the training set and train the model for more accuracy.

Test set:

And finally, test the model’s accuracy with the test set.

The subset of the dataset that you use to test your model after the model has gone through initial vetting by the validation set.

Conclusion:

We have chosen the best-performing Machine Learning model by evaluating multiple Machine Learning algorithms’ performances with our data, In most cases, we can’t select the best model without evaluation, So we train multiple models and pick one.

--

--

Jeswanth Mukesh

Data Science & ML Enthusiast 📈 | Student 🎓| Python developer 🐍 | Python Programming Intern at @facetagr | @kaggle