Browse the glossary using this index

Special | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ALL

Page:  1  2  (Next)
  ALL

A

Picture of Yee Wei Law

Activation function: contemporary options

by Yee Wei Law - Wednesday, 31 May 2023, 10:43 PM
 

This knowledge base entry follows discussion of artificial neural networks and backpropagation.

Contemporary options for are the non-saturating activation functions [Mur22, Sec. 13.4.3], although the term is not accurate.

Below, ( should be understood as the output of the summing junction.

  • The rectified linear unit (ReLU) [NH10] is the unipolar function:

    ReLU is differentiable except at , but by definition, for .

    ReLU has the advantage of having well-behaved derivatives, which are either 0 or 1.

    This simplifies optimisation [ZLLS23, Sec. 5.1.2.1] and mitigates the infamous vanishing gradients problem associated with traditional activation functions.

    ReLU has gained dominance since its introduction.

    ReLU is implemented by the PyTorch function ReLU.

    However, ReLU suffers from the 💀 “dying ReLU” problem during training, when some neurons stop outputting anything other than 0 [G22, Ch. 11]:

    • During training, if a neuron’s weights get updated such that the weighted sum of the neuron’s inputs is negative, the neuron will start outputting 0.
    • When this happens, the neuron is unlikely to resurrect since the gradient of the ReLU function is 0 when its input is negative.
    • In some cases, half of the neurons die, especially when a large learning rate is used.
  • The leaky ReLU (LReLU) [MHN+13] is one of the earliest extensions of ReLU:

    where is fixed and typically set to .

    LReLU is differentiable except at , but by definition, for , thus avoiding the dying ReLU problem.

  • The parametric ReLU (PReLU) [HZRS15] extends LReLU:

    where is a tunable parameter controlling the slope of the negative part of PReLU, and is to be learnt jointly with the model in end-to-end training.

    PReLU is implemented by the PyTorch function PReLU.

  • The exponential linear unit (ELU) [CUH16] is a smooth extension of LReLU:

    where is fixed; see Fig. 1.

    ELU is implemented by the PyTorch function ELU.

    Fig. 1: A plot of the response of an ELU with .
  • The scaled exponential linear unit or self-normalising ELU (SELU) [KUMH17] extends ELU:

    where ensures a slope of larger than 1 for positive inputs; see Fig. 2.

    SELU was invented for self-normalising neural networks (SNNs), which are meant to 1️⃣ be robust to perturbations, 2️⃣ not have high variance in their training errors.

    SNNs push neuron activations to zero mean and unit variance, leading to the same effect as batch normalisation, which enables robust deep learning.

    SELU is implemented by the PyTorch function SELU.

    Fig. 2: A plot of the response of a SELU with .
  • The Gaussian error linear unit (GELU) [HG20] extends ReLU and ELU:

    where is the cumulative distribution function for the Gaussian distribution, and is the error function .

    Unlike most other activation functions, GELU is not convex or monotonic; the increased curvature and non-monotonicity may allow GELUs to more easily approximate complicated functions than ReLUs or ELUs can.

    ReLU gates the input depending upon its sign, whereas GELU weights its input depending upon how much greater it is than other inputs.

    GELU is a popular choice for implementing transformers; see for example Hugging Face’s implementation of activation functions.

    GELU is implemented by the PyTorch function GELU.

    Fig. 3: A plot of the response of a GELU with .

References

[CUH16] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate deep network learning by exponential linear units (ELUs), in ICLR, 2016. Available at https://arxiv.org/abs/1511.07289.
[G22] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed., O’Reilly Media, Inc., 2022. Available at https://learning.oreilly.com/library/view/hands-on-machine-learning/9781098125967/.
[HZRS15] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034. https://doi.org/10.1109/ICCV.2015.123.
[HG20] D. Hendrycks and K. Gimpel, Gaussian error linear units (GELUs), arXiv preprint arXiv:1606.08415, 2020, first appeared in 2016.
[KUMH17] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, Self-normalizing neural networks, in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), 30, Curran Associates, Inc., 2017. Available at https://proceedings.neurips.cc/paper_files/paper/2017/file/5d44ee6f2c3f71b73125876103c8f6c4-Paper.pdf.
[MHN+13] A. L. Maas, A. Y. Hannun, A. Y. Ng, and others, Rectifier nonlinearities improve neural network acoustic models, in Proceedings of the 30th International Conference on Machine Learning, 2013. Available at http://robotics.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf.
[Mur22] K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.
[NH10] V. Nair and G. E. Hinton, Rectified linear units improve restricted Boltzmann machines, in Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, Omnipress, Madison, WI, USA, 2010, p. 807–814.
[ZLLS23] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, 2023, interactive online book, accessed 17 Feb 2023. Available at https://d2l.ai/.

Picture of Yee Wei Law

Active learning

by Yee Wei Law - Wednesday, 25 October 2023, 9:39 AM
 

References

[] .

Picture of Yee Wei Law

Adversarial machine learning

by Yee Wei Law - Saturday, 18 November 2023, 4:31 PM
 

Adversarial machine learning (AML) as a field can be traced back to [HJN+11].

The impact of adversarial examples on deep learning is well known within the computer vision community, and documented in a body of literature that has been growing exponentially since Szegedy et al.’s discovery [SZS+14].

The field is moving so fast that the taxonomy, terminology and threat models are still being standardised.

See MITRE ATLAS.

References

[HJN+11] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. D. Tygar, Adversarial machine learning, in Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, AISec ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 43 – 58. https://doi.org/10.1145/2046684.2046692.
[SZS+14] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, Intriguing properties of neural networks, in International Conference on Learning Representations, 2014. Available at https://research.google/pubs/pub42503/.

Picture of Yee Wei Law

Artificial neural networks and backpropagation

by Yee Wei Law - Wednesday, 7 June 2023, 1:00 PM
 
See 👇 attachment or the latest source on Overleaf.
Tags:

B

Picture of Yee Wei Law

Batch normalisation (BatchNorm)

by Yee Wei Law - Saturday, 24 June 2023, 3:32 PM
 

Watch a high-level explanation of BatchNorm:

Watch more detailed explanation of BatchNorm by Prof Ng:

Watch coverage of BatchNorm in Stanford 2016 course CS231n Lecture 5 Part 2:

References

[IS15] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proceedings of the 32nd International Conference on Machine Learning (F. Bach and D. Blei, eds.), Proceedings of Machine Learning Research 37, PMLR, Lille, France, 07–09 Jul 2015, pp. 448–456.
[LWS+17] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou, Revisiting batch normalization for practical domain adaptation, in ICLR workshop, 2017. Available at https://openreview.net/pdf?id=Hk6dkJQFx.
[STIM18] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, How does batch normalization help optimization?, in Advances in Neural Information Processing Systems (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), 31, Curran Associates, Inc., 2018. Available at https://proceedings.neurips.cc/paper_files/paper/2018/file/905056c1ac1dad141560467e0a99e1cf-Paper.pdf.
[Zha20] X.-D. Zhang, A Matrix Algebra Approach to Artificial Intelligence, Springer, 2020. https://doi.org/10.1007/978-981-15-2770-8.

C

Picture of Yee Wei Law

Cross-entropy loss

by Yee Wei Law - Friday, 31 March 2023, 1:40 PM
 

[Cha19, pp. 11-14]

References

[Cha19] E. Charniak, Introduction to Deep Learning, MIT Press, 2019. Available at https://ebookcentral.proquest.com/lib/unisa/reader.action?docID=6331506.

D

Picture of Yee Wei Law

Domain adaptation

by Yee Wei Law - Wednesday, 14 June 2023, 10:55 AM
 

Domain adaptation is learning a discriminative classifier or other predictor in the presence of a shift of data distribution between the source/training domain and the target/test domain [GUA+16].

References

[GUA+16] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky, Domain-adversarial training of neural networks, Journal of Machine Learning Research 17 no. 59 (2016), 1–35.

Picture of Yee Wei Law

Dropout

by Yee Wei Law - Tuesday, 20 June 2023, 2:35 PM
 

Deep neural networks (DNNs) employ a large number of parameters to learn complex dependencies of outputs on inputs, but overfitting often occurs as a result.

Large DNNs are also slow to converge.

The dropout method implements the intuitive idea of randomly dropping units (along with their connections) from a network during training [SHK+14].

Fig. 1: Sample effect of applying dropout to a neural network in (a). The thinned network in (b) has units marked with a cross removed [SHK+14, Figure 1].

References

[SHK+14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 no. 56 (2014), 1929–1958. Available at http://jmlr.org/papers/v15/srivastava14a.html.

F

Picture of Yee Wei Law

Few-shot learning

by Yee Wei Law - Thursday, 16 February 2023, 3:29 PM
 
Definition 1: Few-shot learning [WYKN20, Definition 2.2]

A type of machine learning problems (specified by experience , task and performance measure ), where contains only a limited number of examples with supervised information for .

References

[] .
[WYKN20] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, Generalizing from a few examples: A survey on few-shot learning, ACM Comput. Surv. 53 no. 3 (2020). https://doi.org/10.1145/3386252.

P

Picture of Yee Wei Law

Problems of vanishing gradients and exploding gradients

by Yee Wei Law - Wednesday, 31 May 2023, 10:39 PM
 

This knowledge base entry follows discussion of artificial neural networks and backpropagation.

The backpropagation (“backprop” for short) algorithm calculates gradients to update each weight.

Unfortunately, gradients often shrink as the algorithm progresses down to the lower layers, with the result that the lower layers’ weights remain virtually unchanged, and training fails to converge to a good solution — this is called the vanishing gradients problem [G22, Ch. 11].

The opposite can also happen: the gradients can keep growing until the layers get excessively large weight updates and the algorithm diverges — this is the exploding gradients problem [G22, Ch. 11].

Both problems plague deep neural networks (DNNs) and recurrent neural networks (RNNs) over very long sequences [Mur22, Sec. 13.4.2].

More generally, deep neural networks suffer from unstable gradients, and different layers may learn at widely different speeds.

Watch Prof Ng’s explanation of the problems:

The problems were observed decades ago and were the reasons why DNNs were mostly abandoned in the early 2000s [G22, Ch. 11].

  • The causes had been traced to the 1️⃣ usage of sigmoid activation functions, and 2️⃣ initialisation of weights to follow the zero-mean Gaussian distribution with standard deviation 1.
  • A sigmoid function saturates at 0 or 1, and when saturated, the derivative is nearly 0.
  • As a remedy, current best practices include using 1️⃣ a rectifier activation function, and 2️⃣ the weight initialisation algorithm called He initialisation.
  • He initialisation [HZRS15, Sec. 2.2]: at layer , weights follow the zero-mean Gaussian distribution with variance , where is the fan-in, or equivalently the number of inputs/weights feeding into layer .
  • He initialisation is implemented by the PyTorch function kaiming_normal_ and the Tensorflow function HeNormal.

Watch Prof Ng’s explanation of weight initialisation:

References

[G22] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed., O’Reilly Media, Inc., 2022. Available at https://learning.oreilly.com/library/view/hands-on-machine-learning/9781098125967/.
[HZRS15] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034. https://doi.org/10.1109/ICCV.2015.123.
[Mur22] K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.


Page:  1  2  (Next)
  ALL