How to Avoid Overfitting in Machine Learning Model?

Overfitting is a typical mistake that many machine learning engineers make, typically beginners. Unfortunately, this mistake can completely ruin your machine learning model, producing incorrect outputs and leading to making the wrong decision. 

What is Overfitting in Machine Learning?

Overfitting in Data Science occurs when a statistical model fits precisely against its training data. It is a modeling error when a function too closely fits a limited set of data points. It primarily focuses on the variables it knows and assumes that these predictions will work on unseen data, failing to predict future observations accurately, defeating its purpose. 

When machine learning algorithms are built, they use a sample dataset to train the model. If the model is too complex or trains on the sample dataset for too long, it starts to learn and memorize irrelevant information. This leads to overfitting as the model fits too closely to the training set and cannot generalize well to new data. If a model fails to generalize well to new data, it cannot perform the task. 

This will produce a low error on the training data but a high error on the test data. Low error rates and high variance are good indications of overfitting. 

What is Underfitting in Machine Learning?

Underfitting is when a model cannot accurately identify the relationship between the input and output variables. This can occur when a model is too simple; therefore, adding more input features or using high-variance models such as Decision Trees can reduce underfitting. 

It is a modeling error that neither models training data nor generalizes new data. This generates a high error rate on the training set and unseen data, resulting in the model not identifying the dominant trend, training errors, and poor performance. 

Signal vs. Noise

A Signal is the true underlying pattern that helps the model learn from the data. In contrast, noise is random and irrelevant data in the dataset. A good machine learning algorithm can differentiate between Signal and Noise. If the model is too complex or incompetent, it can also learn the noise, leading to overfitting. 

For example, if we wanted to model age vs. height in young teenagers. There will be a clear relationship between the two using a sample dataset of a large portion of the population. This is the Signal. 

However, if you could only sample a particular school, there would be more issues with the accuracy of the relationship between the variables.

For example, sampling a school well known for its sports facilities will cause outliers as there will be a higher percentage of pupils selected to attend the school due to their physical attributes, such as height in basketball. You will also encounter randomness, such as young teens who hit puberty at different ages. This is how noise interferes with Signal.

How to detect overfitting?

It is practically impossible to identify overfitting before testing the data. Therefore, the data can be separated into different subsets: training and testing. 

The training set will represent about 80% of the available data and will be used to train the model. In contrast, the test set will represent about 20% of the data and will be used to test the accuracy of the data as it has not interacted with it before. For example, suppose we get an accuracy of over 90% on the training set but 50% accuracy on the test set. In that case, we can automatically assume there is an issue and the model is likely overfitting. 

Another way is using Occam’s Razor principle, starting with a straightforward model and using that as a benchmark. Then, you can use the simple model as a reference point when you build more complex algorithms to see if the additional complexity of the model is worth it. 

Techniques to Prevent Overfitting

Linear models help us to avoid overfitting. However, the majority of real-world problems are non-linear. Below are techniques you can use to prevent overfitting your model. 

Early stopping 

When the model is training, you can measure its performance based on each iteration. Pausing the training before the model starts to learn the noise within the model is an excellent way to prevent overfitting. However, there is also the risk of pausing the training process too early, leading to underfitting. Your goal will be to find the ‘sweet spot’ between underfitting and overfitting. 

Training with more data

Increasing the size of the data used in the training set will improve the model’s accuracy. This makes it easy for algorithms to identify signals better, essentially minimizing errors. In addition, as the model is fed with more training data, it will reduce overfitting on all the samples and will be able to generalize well to new data. 

It provides more opportunities for the model to understand the relationship between input and output variables. This process is more effective when clean, efficient data is inputted; if not, you could be adding more complexity to your model. 

Data Augmentation

The definition of Data Augmentation on Wikipedia is “techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. Data Augmentation helps in regularization to reduce overfitting when training a machine learning model.”

While it is better to input clean, relevant data, it can be expensive. The less expensive alternative is Data Augmentation, allowing you to make the available data appear more diverse.

For example, data Augmentation artificially increases the size of our dataset by making the sample data look slightly different each time the model processes it. This reduces the model from learning the dataset’s characteristics and makes each dataset appear unique to the model. 

There is also the method of passing Noise to the input and output data, making the model more stable. In addition, adding noise makes the data more diverse without affecting data quality and privacy. However, adding noise should be done in moderation to prevent overfitting.

Feature Selection

Feature Selection is the process of reducing the number of input variables by selecting relevant features to use in the model. When building a model, you’ll have several features that will be available to predict a specific output. However, these features may not correlate to one another, essentially being of no use. 

We can test the different features by training them on individual models. This can help us evaluate the model’s generalization capabilities. 

Reducing the number of input variables improves the overall performance of the model as well as reduces the computational cost of modeling. Feature Selection is commonly mistaken for Dimensionality Reduction; although both of these methods are used to simplify models, they are different. 

Ensembling

Ensemble methods are machine learning techniques that create multiple models and combine predictions to produce improved results. As there is a collation of multiple models, ensemble models usually make better predictive performance. The most popular ensembling methods include boosting and bagging.

Bagging

Bagging is an acronym for ‘Bootstrap Aggregation’ and is a way to decrease the variance in the prediction model. Algorithms that have high variance are decision trees, like classification and regression trees (CART)

Bagging aims to reduce the chance of overfitting complex models by generating additional data for training from the dataset. This is done by random sampling with replacement from the original dataset. It trains many “strong” learners (an unconstrained model) in parallel and then combines them to optimize their predictions. 

Boosting

Boosting refers to converting a weak learner to a stronger one. The aim of Boosting is to decrease the bias error build and improve strong predictive models.

This is done by training many “weak” learners (a constrained model) in sequences to focus on learning from previous mistakes. It then combines all the weak learners into a single strong learner. 

Nisha Arya is a Data Scientist and Technical writer from London. 

Having worked in the world of Data Science, she is particularly interested in providing Data Science career advice or tutorials and theory-based knowledge around Data Science. She is a keen learner seeking to broaden her tech knowledge and writing skills while helping guide others.

Need help?

Let us know about your question or problem and we will reach out to you.