Browse the glossary using this index

Special | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ALL

P

Picture of Yee Wei Law

Problems of vanishing gradients and exploding gradients

by Yee Wei Law - Wednesday, 31 May 2023, 10:39 PM
 

This knowledge base entry follows discussion of artificial neural networks and backpropagation.

The backpropagation (“backprop” for short) algorithm calculates gradients to update each weight.

Unfortunately, gradients often shrink as the algorithm progresses down to the lower layers, with the result that the lower layers’ weights remain virtually unchanged, and training fails to converge to a good solution — this is called the vanishing gradients problem [G22, Ch. 11].

The opposite can also happen: the gradients can keep growing until the layers get excessively large weight updates and the algorithm diverges — this is the exploding gradients problem [G22, Ch. 11].

Both problems plague deep neural networks (DNNs) and recurrent neural networks (RNNs) over very long sequences [Mur22, Sec. 13.4.2].

More generally, deep neural networks suffer from unstable gradients, and different layers may learn at widely different speeds.

Watch Prof Ng’s explanation of the problems:

The problems were observed decades ago and were the reasons why DNNs were mostly abandoned in the early 2000s [G22, Ch. 11].

  • The causes had been traced to the 1️⃣ usage of sigmoid activation functions, and 2️⃣ initialisation of weights to follow the zero-mean Gaussian distribution with standard deviation 1.
  • A sigmoid function saturates at 0 or 1, and when saturated, the derivative is nearly 0.
  • As a remedy, current best practices include using 1️⃣ a rectifier activation function, and 2️⃣ the weight initialisation algorithm called He initialisation.
  • He initialisation [HZRS15, Sec. 2.2]: at layer , weights follow the zero-mean Gaussian distribution with variance , where is the fan-in, or equivalently the number of inputs/weights feeding into layer .
  • He initialisation is implemented by the PyTorch function kaiming_normal_ and the Tensorflow function HeNormal.

Watch Prof Ng’s explanation of weight initialisation:

References

[G22] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed., O’Reilly Media, Inc., 2022. Available at https://learning.oreilly.com/library/view/hands-on-machine-learning/9781098125967/.
[HZRS15] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034. https://doi.org/10.1109/ICCV.2015.123.
[Mur22] K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.

Picture of Yee Wei Law

PyTorch

by Yee Wei Law - Tuesday, 21 May 2024, 1:27 PM
 

Installation instructions:

  • Assuming the conda environment called pt (for “PyTorch”) does not yet exist, create and activate it using the commands:

    conda create -n pt
    conda activate pt
  • Install the latest version of PyTorch (version 2.3 as of writing) assuming the existence of a CUDA-compatible GPU:

    conda install pytorch torchvision torchaudio pytorch-cuda=12.1 matplotlib jupyterlab nbdime -c pytorch -c nvidia

    Notice the conda channels being used 👆 are pytorch and nvidia.

    Due to the total size of files to be installed, more than one installation attempt may be necessary.

    For those without a CUDA-compatible GPU, use the command instead: conda install pytorch torchvision torchaudio cpuonly matplotlib jupyterlab nbdime -c pytorch.

  • Check if CUDA is available through PyTorch. In a Python shell, run the code:

    import torch; torch.cuda.is_available()

    If the output is True, then CUDA is available. If CUDA is available, running the command below will show you the name of your GPU:

    torch.cuda.get_device_name()