Browse the glossary using this index

Special | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ALL

P

Problems of vanishing gradients and exploding gradients

by Yee Wei Law - Monday, 27 January 2025, 4:46 PM

This knowledge base entry follows discussion of artificial neural networks and backpropagation.

The backpropagation (“backprop” for short) algorithm calculates gradients to update each weight.

Unfortunately, gradients often shrink as the algorithm progresses down to the lower layers, with the result that the lower layers’ weights remain virtually unchanged, and training fails to converge to a good solution — this is called the vanishing gradients problem [G22, Ch. 11].

The opposite can also happen: the gradients can keep growing until the layers get excessively large weight updates and the algorithm diverges — this is the exploding gradients problem [G22, Ch. 11].

Both problems plague deep neural networks (DNNs) and recurrent neural networks (RNNs) over very long sequences [Mur22, Sec. 13.4.2].

More generally, deep neural networks suffer from unstable gradients, and different layers may learn at widely different speeds.

Watch Prof Ng’s explanation of the problems:

The problems were observed decades ago and were the reasons why DNNs were mostly abandoned in the early 2000s [G22, Ch. 11].

The causes had been traced to the 1️⃣ usage of sigmoid activation functions, and 2️⃣ initialisation of weights to follow the zero-mean Gaussian distribution with standard deviation 1.
A sigmoid function saturates at 0 or 1, and when saturated, the derivative is nearly 0.
As a remedy, current best practices include using 1️⃣ a rectifier activation function, and 2️⃣ the weight initialisation algorithm called He initialisation.
He initialisation [HZRS15, Sec. 2.2]: at layer $\ell$ , weights follow the zero-mean Gaussian distribution with variance $2/n_\ell$ , where $n_\ell$ is the fan-in, or equivalently the number of inputs/weights feeding into layer $\ell$ .
He initialisation is implemented by the PyTorch function kaiming_normal_ and the Tensorflow function HeNormal.

Watch Prof Ng’s explanation of weight initialisation:

References

[G22]	A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed., O’Reilly Media, Inc., 2022. Available at https://learning.oreilly.com/library/view/hands-on-machine-learning/9781098125967/.
[HZRS15]	K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034. https://doi.org/10.1109/ICCV.2015.123.
[Mur22]	K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.

PyTorch

by Yee Wei Law - Saturday, 31 May 2025, 2:38 PM

Installation instructions:

Assuming the conda environment called pt (for “PyTorch”) does not yet exist, create and activate it using the commands:
```
conda create -n pt python=3.12
conda activate pt
```
Install the necessary conda packages:
conda install -c conda-forge jupyterlab jupyterlab-git lightning matplotlib nbdime pandas scikit-learn seaborn
Install the latest version of PyTorch (version 2.7 as of writing) through pip.

Option 1: If you have a CUDA-capable GPU (only NVIDIA's GPUs are so far), use the command below, where the string cu128 indicates CUDA version 12.8. If you have a a different version of CUDA, change 128 to reflect the version you have:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
Install the matching version of CUDA Toolkit from NVIDIA.

Option 2: If you do not have a CUDA-capable GPU, use the command below:
pip3 install torch torchvision torchaudio
Due to the total size of files to be installed, more than one installation attempt may be necessary.
Check if CUDA is available through PyTorch by running the command below in the command line:
python -c "import torch; print(torch.cuda.get_device_name() if torch.cuda.is_available() else 'No CUDA')"
The command above will print the name of your GPU if the preceding installation went successfully and you do have a CUDA-capable GPU.

Cyber Engineering Knowledge Base

Artificial intelligence (including machine learning which includes deep learning)

P

Problems of vanishing gradients and exploding gradients

References

PyTorch