Browse the glossary using this index

Special | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ALL

Page:  1  2  (Next)
  ALL

A

Picture of Yee Wei Law

Activation function: contemporary options

by Yee Wei Law - Wednesday, 31 May 2023, 10:43 PM
 

This knowledge base entry follows discussion of artificial neural networks and backpropagation.

Contemporary options for are the non-saturating activation functions [Mur22, Sec. 13.4.3], although the term is not accurate.

Below, ( should be understood as the output of the summing junction.

  • The rectified linear unit (ReLU) [NH10] is the unipolar function:

    ReLU is differentiable except at , but by definition, for .

    ReLU has the advantage of having well-behaved derivatives, which are either 0 or 1.

    This simplifies optimisation [ZLLS23, Sec. 5.1.2.1] and mitigates the infamous vanishing gradients problem associated with traditional activation functions.

    ReLU has gained dominance since its introduction.

    ReLU is implemented by the PyTorch function ReLU.

    However, ReLU suffers from the 💀 “dying ReLU” problem during training, when some neurons stop outputting anything other than 0 [G22, Ch. 11]:

    • During training, if a neuron’s weights get updated such that the weighted sum of the neuron’s inputs is negative, the neuron will start outputting 0.
    • When this happens, the neuron is unlikely to resurrect since the gradient of the ReLU function is 0 when its input is negative.
    • In some cases, half of the neurons die, especially when a large learning rate is used.
  • The leaky ReLU (LReLU) [MHN+13] is one of the earliest extensions of ReLU:

    where is fixed and typically set to .

    LReLU is differentiable except at , but by definition, for , thus avoiding the dying ReLU problem.

  • The parametric ReLU (PReLU) [HZRS15] extends LReLU:

    where is a tunable parameter controlling the slope of the negative part of PReLU, and is to be learnt jointly with the model in end-to-end training.

    PReLU is implemented by the PyTorch function PReLU.

  • The exponential linear unit (ELU) [CUH16] is a smooth extension of LReLU:

    where is fixed; see Fig. 1.

    ELU is implemented by the PyTorch function ELU.

    Fig. 1: A plot of the response of an ELU with .
  • The scaled exponential linear unit or self-normalising ELU (SELU) [KUMH17] extends ELU:

    where ensures a slope of larger than 1 for positive inputs; see Fig. 2.

    SELU was invented for self-normalising neural networks (SNNs), which are meant to 1️⃣ be robust to perturbations, 2️⃣ not have high variance in their training errors.

    SNNs push neuron activations to zero mean and unit variance, leading to the same effect as batch normalisation, which enables robust deep learning.

    SELU is implemented by the PyTorch function SELU.

    Fig. 2: A plot of the response of a SELU with .
  • The Gaussian error linear unit (GELU) [HG20] extends ReLU and ELU:

    where is the cumulative distribution function for the Gaussian distribution, and is the error function .

    Unlike most other activation functions, GELU is not convex or monotonic; the increased curvature and non-monotonicity may allow GELUs to more easily approximate complicated functions than ReLUs or ELUs can.

    ReLU gates the input depending upon its sign, whereas GELU weights its input depending upon how much greater it is than other inputs.

    GELU is a popular choice for implementing transformers; see for example Hugging Face’s implementation of activation functions.

    GELU is implemented by the PyTorch function GELU.

    Fig. 3: A plot of the response of a GELU with .

References

[CUH16] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate deep network learning by exponential linear units (ELUs), in ICLR, 2016. Available at https://arxiv.org/abs/1511.07289.
[G22] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed., O’Reilly Media, Inc., 2022. Available at https://learning.oreilly.com/library/view/hands-on-machine-learning/9781098125967/.
[HZRS15] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034. https://doi.org/10.1109/ICCV.2015.123.
[HG20] D. Hendrycks and K. Gimpel, Gaussian error linear units (GELUs), arXiv preprint arXiv:1606.08415, 2020, first appeared in 2016.
[KUMH17] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, Self-normalizing neural networks, in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), 30, Curran Associates, Inc., 2017. Available at https://proceedings.neurips.cc/paper_files/paper/2017/file/5d44ee6f2c3f71b73125876103c8f6c4-Paper.pdf.
[MHN+13] A. L. Maas, A. Y. Hannun, A. Y. Ng, and others, Rectifier nonlinearities improve neural network acoustic models, in Proceedings of the 30th International Conference on Machine Learning, 2013. Available at http://robotics.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf.
[Mur22] K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.
[NH10] V. Nair and G. E. Hinton, Rectified linear units improve restricted Boltzmann machines, in Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, Omnipress, Madison, WI, USA, 2010, p. 807–814.
[ZLLS23] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, 2023, interactive online book, accessed 17 Feb 2023. Available at https://d2l.ai/.

Picture of Yee Wei Law

Active learning

by Yee Wei Law - Wednesday, 25 October 2023, 9:39 AM
 

References

[] .

Picture of Yee Wei Law

Adversarial machine learning

by Yee Wei Law - Saturday, 18 November 2023, 4:31 PM
 

Adversarial machine learning (AML) as a field can be traced back to [HJN+11].

The impact of adversarial examples on deep learning is well known within the computer vision community, and documented in a body of literature that has been growing exponentially since Szegedy et al.’s discovery [SZS+14].

The field is moving so fast that the taxonomy, terminology and threat models are still being standardised.

See MITRE ATLAS.

References

[HJN+11] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. D. Tygar, Adversarial machine learning, in Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, AISec ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 43 – 58. https://doi.org/10.1145/2046684.2046692.
[SZS+14] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, Intriguing properties of neural networks, in International Conference on Learning Representations, 2014. Available at https://research.google/pubs/pub42503/.

Picture of Yee Wei Law

Artificial neural networks and backpropagation

by Yee Wei Law - Wednesday, 7 June 2023, 1:00 PM
 
See 👇 attachment or the latest source on Overleaf.
Tags:

B

Picture of Yee Wei Law

Batch normalisation (BatchNorm)

by Yee Wei Law - Saturday, 24 June 2023, 3:32 PM
 

Watch a high-level explanation of BatchNorm:

Watch more detailed explanation of BatchNorm by Prof Ng:

Watch coverage of BatchNorm in Stanford 2016 course CS231n Lecture 5 Part 2:

References

[IS15] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proceedings of the 32nd International Conference on Machine Learning (F. Bach and D. Blei, eds.), Proceedings of Machine Learning Research 37, PMLR, Lille, France, 07–09 Jul 2015, pp. 448–456.
[LWS+17] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou, Revisiting batch normalization for practical domain adaptation, in ICLR workshop, 2017. Available at https://openreview.net/pdf?id=Hk6dkJQFx.
[STIM18] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, How does batch normalization help optimization?, in Advances in Neural Information Processing Systems (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), 31, Curran Associates, Inc., 2018. Available at https://proceedings.neurips.cc/paper_files/paper/2018/file/905056c1ac1dad141560467e0a99e1cf-Paper.pdf.
[Zha20] X.-D. Zhang, A Matrix Algebra Approach to Artificial Intelligence, Springer, 2020. https://doi.org/10.1007/978-981-15-2770-8.

C

Picture of Yee Wei Law

Cross-entropy loss

by Yee Wei Law - Friday, 31 March 2023, 1:40 PM
 

[Cha19, pp. 11-14]

References

[Cha19] E. Charniak, Introduction to Deep Learning, MIT Press, 2019. Available at https://ebookcentral.proquest.com/lib/unisa/reader.action?docID=6331506.

D

Picture of Yee Wei Law

Domain adaptation

by Yee Wei Law - Wednesday, 14 June 2023, 10:55 AM
 

Domain adaptation is learning a discriminative classifier or other predictor in the presence of a shift of data distribution between the source/training domain and the target/test domain [GUA+16].

References

[GUA+16] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky, Domain-adversarial training of neural networks, Journal of Machine Learning Research 17 no. 59 (2016), 1–35.

Picture of Yee Wei Law

Dropout

by Yee Wei Law - Tuesday, 20 June 2023, 2:35 PM
 

Deep neural networks (DNNs) employ a large number of parameters to learn complex dependencies of outputs on inputs, but overfitting often occurs as a result.

Large DNNs are also slow to converge.

The dropout method implements the intuitive idea of randomly dropping units (along with their connections) from a network during training [SHK+14].

Fig. 1: Sample effect of applying dropout to a neural network in (a). The thinned network in (b) has units marked with a cross removed [SHK+14, Figure 1].

References

[SHK+14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 no. 56 (2014), 1929–1958. Available at http://jmlr.org/papers/v15/srivastava14a.html.

F

Picture of Yee Wei Law

Few-shot learning

by Yee Wei Law - Thursday, 16 February 2023, 3:29 PM
 
Definition 1: Few-shot learning [WYKN20, Definition 2.2]

A type of machine learning problems (specified by experience , task and performance measure ), where contains only a limited number of examples with supervised information for .

References

[] .
[WYKN20] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, Generalizing from a few examples: A survey on few-shot learning, ACM Comput. Surv. 53 no. 3 (2020). https://doi.org/10.1145/3386252.

P

Picture of Yee Wei Law

Problems of vanishing gradients and exploding gradients

by Yee Wei Law - Wednesday, 31 May 2023, 10:39 PM
 

This knowledge base entry follows discussion of artificial neural networks and backpropagation.

The backpropagation (“backprop” for short) algorithm calculates gradients to update each weight.

Unfortunately, gradients often shrink as the algorithm progresses down to the lower layers, with the result that the lower layers’ weights remain virtually unchanged, and training fails to converge to a good solution — this is called the vanishing gradients problem [G22, Ch. 11].

The opposite can also happen: the gradients can keep growing until the layers get excessively large weight updates and the algorithm diverges — this is the exploding gradients problem [G22, Ch. 11].

Both problems plague deep neural networks (DNNs) and recurrent neural networks (RNNs) over very long sequences [Mur22, Sec. 13.4.2].

More generally, deep neural networks suffer from unstable gradients, and different layers may learn at widely different speeds.

Watch Prof Ng’s explanation of the problems:

The problems were observed decades ago and were the reasons why DNNs were mostly abandoned in the early 2000s [G22, Ch. 11].

  • The causes had been traced to the 1️⃣ usage of sigmoid activation functions, and 2️⃣ initialisation of weights to follow the zero-mean Gaussian distribution with standard deviation 1.
  • A sigmoid function saturates at 0 or 1, and when saturated, the derivative is nearly 0.
  • As a remedy, current best practices include using 1️⃣ a rectifier activation function, and 2️⃣ the weight initialisation algorithm called He initialisation.
  • He initialisation [HZRS15, Sec. 2.2]: at layer , weights follow the zero-mean Gaussian distribution with variance , where is the fan-in, or equivalently the number of inputs/weights feeding into layer .
  • He initialisation is implemented by the PyTorch function kaiming_normal_ and the Tensorflow function HeNormal.

Watch Prof Ng’s explanation of weight initialisation:

References

[G22] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed., O’Reilly Media, Inc., 2022. Available at https://learning.oreilly.com/library/view/hands-on-machine-learning/9781098125967/.
[HZRS15] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034. https://doi.org/10.1109/ICCV.2015.123.
[Mur22] K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.

Picture of Yee Wei Law

PyTorch

by Yee Wei Law - Wednesday, 24 May 2023, 10:14 PM
 

Installation instructions:

  • Assuming the conda environment called pt (for “PyTorch”) does not yet exist, create and activate it using the commands:

    conda create -n pt
    conda activate pt
  • Install the latest version of PyTorch (version 2.0.1 as of writing) assuming the existence of a CUDA-compatible GPU:

    conda install pytorch torchvision torchaudio pytorch-cuda=11.8 matplotlib jupyterlab nbdime -c pytorch -c nvidia

    Notice the conda channels being used 👆 are pytorch and nvidia.

    Due to the total size of files to be installed, more than one installation attempt may be necessary.

    For those without a CUDA-compatible GPU, use the command instead: conda install pytorch torchvision torchaudio cpuonly matplotlib jupyterlab nbdime -c pytorch.

  • Check if CUDA is available through PyTorch. In a Python shell, run the code:

    import torch; torch.cuda.is_available()

    If the output is True, then CUDA is available. If CUDA is available, running the command below will show you the name of your GPU:

    torch.cuda.get_device_name()

S

Picture of Yee Wei Law

Self-supervised learning

by Yee Wei Law - Tuesday, 25 April 2023, 10:20 AM
 

In his 2018 talk at EPFL and his AAAI 2020 keynote speech, Turing award winner Yann LeCun referred to self-supervised learning (SSL, not to be confused with Secure Socket Layer) as an algorithm that predicts any parts of its input for any observed part.

A standard definition of SSL remains as of writing elusive, but it is characterised by [LZH+23, p. 857]:

  • derivation of labels from data through a semi-automatic process;
  • prediction of parts of the data from other parts, where “other parts” could be incomplete, transformed, distorted or corrupted (see Fig. 1).
Fig. 1: Unlike supervised and unsupervised learning, in self-supervised learning, the related, co-occurring information in “Input 2” is used to derive training labels [LZH+23, Fig. 1]. This related information can be a different modality of “Input 1”, or parts of “Input 1”, or another form of “Input 1”.

SSL can be understood as learning to recover parts or some features of the original input, hence it is also called self-supervised representation learning.

SSL has two distinct phases (see Fig. 2):

  1. unsupervised pre-training (which some authors [Mur22, Sec. 19.2.4] refer to as SSL itself), where a series of handcrafted auxiliary optimisation problems — called proxy tasks or pretext tasks — are solved to generate pseudo labels or supervisory signals from unlabelled data [LJP+22, Sec. 1]; and
  2. knowledge transfer, where the pre-trained model is fine-tuned on labelled data — not only for performance improvement but also over-fitting reduction — for downstream tasks.
Fig. 2: The two-stage pipeline of SSL [JT21, Fig. 1]. Here, the convolutional neural network, ConvNet, is only an example of a machine learning algorithm used for the pretext and downstream tasks.

Different authors classify SSL algorithms slightly differently, but based on the pretext tasks, two distinct approaches are identifiable, namely generative and contrastive (see Fig. 2); the other approaches are either a hybrid of these two approaches, namely generative-contrastive / adversarial, or something else entirely.

Fig. 3: Generative, contrastive, as well as a hybrid of generative and contrastive pre-training [LZH+23, Fig. 4]. Generative pre-training does not involve a discriminator. A contrastive discriminator is usually lightweight (e.g., a two/three-layer multilayer perceptron), hence the label “(Light)”.

In Fig. 3, the generative (or generation-based) pre-training pipeline consists of a generator that 1️⃣ uses an encoder to encode input into an explicit vector , and a decoder to reconstruct from as ; 2️⃣ is trained to minimise the reconstruction loss, which is a function of the difference between and .

In Fig. 3, the contrastive (or contrast-based) pre-training pipeline consists of two components:

  1. the generator uses an encoder to encode two versions of the input, namely and which can be related to each other through data augmentation, into two representations;
  2. the discriminator computes the contrastive loss based on the difference between the two representations, so that the generator can be trained to minimise the contrastive loss.

Figs. 4-5 illustrate generative and contrastive pre-training in greater details using graph learning as the context.

Fig. 4: Applying generative SSL to graph learning [LJP+22, Fig. 3(a)]. “Representations” here is equivalent to in Fig. 3.
Fig. 5: Applying contrastive SSL to graph learning [LJP+22, Fig. 3(c)]. The two “Augmented Graphs” here correspond to and in Fig. 3. “Representations” here is equivalent to in Fig. 3. To resolve discrepancies between Fig. 3 and Fig. 5, consider Fig. 5 to be right.

Fig. 5: Classification of self-supervised learning algorithms [LZH+23, Fig. 3].

An extensive list of references on SSL can be found on GitHub.

References

[JT21] L. Jing and Y. Tian, Self-supervised visual feature learning with deep neural networks: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 no. 11 (2021), 4037–4058. https://doi.org/10.1109/TPAMI.2020.2992393.
[KNH+22] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, Transformers in vision: A survey, ACM Comput. Surv. 54 no. 10s (2022). https://doi.org/10.1145/3505244.
[LJP+22] Y. Liu, M. Jin, S. Pan, C. Zhou, Y. Zheng, F. Xia, and P. Yu, Graph self-supervised learning: A survey, IEEE Transactions on Knowledge and Data Engineering (2022), early access. https://doi.org/10.1109/TKDE.2022.3172903.
[LZH+23] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, Self-supervised learning: Generative or contrastive, IEEE Transactions on Knowledge and Data Engineering 35 no. 1 (2023), 857–876. https://doi.org/10.1109/TKDE.2021.3090866.
[Mur22] K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.

Picture of Yee Wei Law

Standardising/standardisation and whitening

by Yee Wei Law - Tuesday, 20 June 2023, 7:34 AM
 

Given a dataset , where denotes the number of samples and denotes the number of features, it is a common practice to preprocess so that each column has zero mean and unit variance; this is called standardising the data [Mur22, Sec. 7.4.5].

Standardising forces the variance (per column) to be 1 but does not remove correlation between columns.

Decorrelation necessitates whitening.

Whitening is a linear transformation of measurement that produces decorrelated such that the covariance matrix , where is called a whitening matrix [CPSK07, Sec. 2.5.3].

All , and have the same number of rows, denoted ,  which satisfies ; if , then dimensionality reduction is also achieved besides whitening.

A whitening matrix can be obtained using eigenvalue decomposition:

where is an orthogonal matrix containing the covariance matrix’s eigenvectors as its columns, and is the diagonal matrix of the covariance matrix’s eigenvalues [ZX09, p. 74]. Based on the decomposition, the whitening matrix can be defined as

above is called the PCA whitening matrix [Mur22, Sec. 7.4.5].

References

[CPSK07] K. J. Cios, W. Pedrycz, R. W. Swiniarski, and L. A. Kurgan, Data Mining: A Knowledge Discovery Approach, Springer New York, NY, 2007. https://doi.org/10.1007/978-0-387-36795-8.
[Mur22] K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.
[ZX09] N. Zheng and J. Xue, Statistical Learning and Pattern Analysis for Image and Video Processing, Springer London, 2009. https://doi.org/10.1007/978-1-84882-312-9.

T

Picture of Yee Wei Law

Transfer learning

by Yee Wei Law - Friday, 16 June 2023, 2:25 PM
 

References

[Mur22] K. P. Murphy, Probabilistic Machine Learning: An Introduction, MIT Press, 2022. Available at http://probml.ai.
[ZQD+21] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, A comprehensive survey on transfer learning, Proceedings of the IEEE 109 no. 1 (2021), 43–76. https://doi.org/10.1109/JPROC.2020.3004555.

Picture of Yee Wei Law

Transformer

by Yee Wei Law - Friday, 16 June 2023, 2:19 PM
 

References

[] .


Page:  1  2  (Next)
  ALL