Regularization Techniques

7 min readJan 21, 2021

Regularization is a technique used in an attempt to solve the overfitting problem in statistical models.

First of all, I want to clarify how this problem of overfitting arises.

When someone wants to model a problem, let’s say trying to predict the wage of someone based on his age, he will first try a linear regression model with age as an independent variable and wage as a dependent one. This model will mostly fail, since it is too simple.

Then, you might think: well, I also have the age, the sex and the education of each individual in my data set. I could add these as explaining variables.

Your model becomes more interesting and more complex. You measure its accuracy regarding a loss metric L(X,Y)L(X,Y) where XX is your design matrix and YY is the observations (also denoted targets) vector (here the wages).

You find out that your result are quite good but not as perfect as you wish.

So you add more variables: location, profession of parents, social background, number of children, weight, number of books, preferred color, best meal, last holidays destination and so on and so forth.

Your model will do good but it is probably overfitting, i.e. it will probably have poor prediction and generalization power: it sticks too much to the data and the model has probably learned the background noise while being fit. This isn’t of course acceptable.

So how do you solve this?

It is here where the regularization technique comes in handy.

You penalize your loss function by adding a multiple of an L1 (LASSO) or an L2 (Ridge) norm of your weights vector w (it is the vector of the learned parameters in your linear regression). You get the following equation:

L(X,Y)+λN(w)L(X,Y)+λN(w)

(NN is either the L1, L2 or any other norm)

This will help you avoid overfitting and will perform, at the same time, features selection for certain regularization norms (the L1 in the LASSO does the job).

Finally you might ask: OK I have everything now. How can I tune in the regularization term λλ?

One possible answer is to use cross-validation: you divide your training data, you train your model for a fixed value of λλ and test it on the remaining subsets and repeat this procedure while varying λλ. Then you select the best λλ that minimizes your loss function.

L1 L2 Regularization :

L1 and L2 regularisation owes its name to L1 and L2 norm of a vector w respectively. Here’s a primer on norms:

L1 norm

A linear regression model that implements L1 norm for regularization is called lasso regression, and one that implements (squared) L2 norm for regularization is called ridge regression. To implement these two, note that the linear regression model stays the same:

but it is the calculation of the loss function that includes these regularization terms:

Loss function with no regularization

The regularisation terms are ‘constraints’ by which an optimisation algorithm must ‘adhere to’ when minimising the loss function, apart from having to minimise the error between the true y and the predicted ŷ.

Dropout :

original network vs network with some nodes dropped out

The term “dropout” refers to dropping out units (both hidden and visible) in a neural network.

Simply put, dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of neurons which is chosen at random. By “ignoring”, I mean these units are not considered during a particular forward or backward pass.

More technically, At each training stage, individual nodes are either dropped out of the net with probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed.

Why do we need Dropout?

Given that we know a bit about dropout, a question arises — why do we need dropout at all? Why do we need to literally shut-down parts of a neural networks?

The answer to these questions is “to avoid over-fitting”.

A fully connected layer occupies most of the parameters, and hence, neurons develop co-dependency amongst each other during training which curbs the individual power of each neuron leading to over-fitting of training data.

Now that we know a little bit about dropout and the motivation, let’s go into some detail. If you just wanted an overview of dropout in neural networks, the above two sections would be sufficient. In this section, I will touch upon some more technicality.

In machine learning, regularization is way to prevent over-fitting. Regularization reduces over-fitting by adding a penalty to the loss function. By adding this penalty, the model is trained such that it does not learn interdependent set of features weights. Those of you who know Logistic Regression might be familiar with L1 (Laplacian) and L2 (Gaussian) penalties.

Dropout is an approach to regularization in neural networks which helps reducing interdependent learning amongst the neurons.

Training Phase:

Training Phase: For each hidden layer, for each training sample, for each iteration, ignore (zero out) a random fraction, p, of nodes (and corresponding activations).

Testing Phase:

Use all activations, but reduce them by a factor p (to account for the missing activations during training).

Some Observations:

Dropout forces a neural network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
Dropout roughly doubles the number of iterations required to converge. However, training time for each epoch is less.
With H hidden units, each of which can be dropped, we have
2^H possible models. In testing phase, the entire network is considered and each activation is reduced by a factor p.

Data Augmentation :

Having more data (dataset / samples) is a best way to get better consistent estimators (ML model). In the real world getting a large volume of useful data for training a model is cumbersome and labelling is an extremely tedious task.

Either labelling requires more manual annotation, example — For creating a better image classifier we can use Mturk and involve more man power to generate dataset or doing survey in social media and asking people to participate and generate dataset. Above process can yield good dataset however those are difficult to carry and expensive. Having small dataset will lead to the well know Over fitting problem.

Data Augmentation is one of the interesting regularization technique to resolve the above problem. The concept is very simple, this technique generates new training data from given original dataset. Dataset Augmentation provides a cheap and easy way to increase the amount of your training data.

It is worth knowing that Keras’ provided ImageDataGenerator for generating Data Augmentation.

Sample code for random deletion

def random_deletion(words, p):
        """
        Randomly delete words from the sentence with probability p
        """        
#obviously, if there's only one word, don't delete it
        
if len(words) == 1:
    return words        
#randomly delete words with probability p
new_words = []
for word in words:
    r = random.uniform(0, 1)
    if r > p:
       new_words.append(word)        
#if you end up deleting all words, just return a random word
        
if len(new_words) == 0:
   rand_int = random.randint(0, len(words)-1)
   return [words[rand_int]]        
return new_words

Furthermore, when comparing two machine learning algorithms train both with either augmented or non-augmented dataset. Otherwise, no subjective decision can be made on which algorithm performed better

Early Stopping :

A problem with training neural networks is in the choice of the number of training epochs to use.

Too many epochs can lead to overfitting of the training dataset, whereas too few may result in an underfit model. Early stopping is a method that allows you to specify an arbitrary large number of training epochs and stop training once the model performance stops improving on a hold out validation dataset.

From Figure Stop training, it can be observed

The training set accuracy continues to increase, through all the Epochs
The validation set accuracy, however, saturates before complete all epochs. This is where the model can be stopped training.

Early Stopping, hence does not only protect against overfitting but needs considerably less number of Epoch to train.

Code excerpt: The below code demonstrates that we have hold-out 20% of the training data as a validation set.

fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()trn_images, valid_images, trn_labels, valid_labels = train_test_split(train_images, train_labels,test_size=0.2)

Some other regularization techniques are loss function regularization, dropout, data augmentation, etc. If interested you can waste some time of your at your own risk in this video tutorial.

References :

https://www.analyticsvidhya.com/blog/2018/04/fundamentals-deep-learning-regularization-techniques/#:~:text=Regularization%20is%20a%20technique%20which,the%20unseen%20data%20as%20well.

https://en.wikipedia.org/wiki/Regularization_(mathematics)