“Deep Learning”: Optimization Techniques

Error, Loss Function, Optimization

1. Feature Scaling:

Where to use Feature Scaling

Advantages :

disadvantages :

2. Batch normalization :

Why do we use batch normalization?

How does batch normalization work?

Advantages :

  • Reduces the vanishing gradients problem
  • Less sensitive to the weight initialization
  • Able to use much larger learning rates to speed up the learning process
  • Acts like a regularize

disadvantages :

  • Slower predictions due to the extra computations at each layer

3. Mini Batch Gradient descent (MGD) :

What is batch size?

Advantages :

  • The model update frequency is higher than BGD: In MGD, we are not waiting for entire data, we are just passing 50 records or 200 or 100 or 256, then we are passing for optimization.
  • The batching allows both efficiency of not having all training data in memory and algorithms implementations. We are controlling memory consumption as well to store losses for each and every datasets.
  • The batches updates provide a computationally more efficient process than SGD.

disadvantages :

  • No guarantee of convergence of a error in a better way.
  • Since the 50 sample records we take , are not representing the properties (or variance) of entire datasets. Do, this is the reason that we will not be able to get an convergence i.e., we won’t get absolute global or local minima at any point of a time.
  • While using MGD, since we are taking records in batches, so, it might happen that in some batches, we get some error and in dome other batches, we get some other error. So, we will have to control the learning rate by ourself , whenever we use MGD. If learning rate is very low, so the convergence rate will also fall. If learning rate is too high, we won’t get an absolute global or local minima. So we need to control the learning rate.

4. Gradient descent with momentum :

Implementation :

Advantages :

Disadvantages :

5. Adam optimization :

Advantages :

Disadvantages :

6. RMSProp optimization :

Gradient descent with momenttum
RMSprop optimizer

disadvantages :

7. Learning rate decay :

lr *= (1. / (1. + self.decay * self.iterations))

References :




student at holbertonschool

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Transform Dark Data into Purpose-led data with Adobe Experience Platform

Dissecting Data Science: A Blog around working with Data Scientists

How we applied Enterprise Design Thinking to build a data science experience in Cognos Analytics

Personalizations and Recommendations have revolutionized most of the industries in the past…

Same Data, Two Story. An insight on Simpson Paradox.

data.table: R’s Best Data Object

Graph Visualization of Panama Papers Data In Neo4j

Sense in the census?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Hamdi Ghorbel

Hamdi Ghorbel

student at holbertonschool

More from Medium


Fixing The SQL Error 1071

Inventory Material Segmentation Using K Means Clustering|detail code explained

Mike Anzivino — Things to know before moving to Fort Myers, FL!