Browse the glossary using this index

Special | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ALL

Problems of vanishing gradients and exploding gradients

by Yee Wei Law - Monday, 27 January 2025, 4:46 PM

This knowledge base entry follows discussion of artificial neural networks and backpropagation.

The backpropagation (“backprop” for short) algorithm calculates gradients to update each weight.

Unfortunately, gradients often shrink as the algorithm progresses down to the lower layers, with the result that the lower layers’ weights remain virtually unchanged, and training fails to converge to a good solution — this is called the vanishing gradients problem [G22, Ch. 11].

The opposite can also happen: the gradients can keep growing until the layers get excessively large weight updates and the algorithm diverges — this is the exploding gradients problem [G22, Ch. 11].

Both problems plague deep neural networks (DNNs) and recurrent neural networks (RNNs) over very long sequences [Mur22, Sec. 13.4.2].

More generally, deep neural networks suffer from unstable gradients, and different layers may learn at widely different speeds.

Watch Prof Ng’s explanation of the problems:

The problems were observed decades ago and were the reasons why DNNs were mostly abandoned in the early 2000s [G22, Ch. 11].

The causes had been traced to the 1️⃣ usage of sigmoid activation functions, and 2️⃣ initialisation of weights to follow the zero-mean Gaussian distribution with standard deviation 1.
A sigmoid function saturates at 0 or 1, and when saturated, the derivative is nearly 0.
As a remedy, current best practices include using 1️⃣ a rectifier activation function, and 2️⃣ the weight initialisation algorithm called He initialisation.
He initialisation [HZRS15, Sec. 2.2]: at layer $\ell$ , weights follow the zero-mean Gaussian distribution with variance $2/n_\ell$ , where $n_\ell$ is the fan-in, or equivalently the number of inputs/weights feeding into layer $\ell$ .
He initialisation is implemented by the PyTorch function kaiming_normal_ and the Tensorflow function HeNormal.

Watch Prof Ng’s explanation of weight initialisation:

References

[G22]	A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed., O’Reilly Media, Inc., 2022. Available at https://learning.oreilly.com/library/view/hands-on-machine-learning/9781098125967/.
[HZRS15]	K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034. https://doi.org/10.1109/ICCV.2015.123.
[Mur22]	K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.

Cyber Engineering Knowledge Base

Artificial intelligence (including machine learning which includes deep learning)

Problems of vanishing gradients and exploding gradients

References