Browse the glossary using this index

Special | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ALL

Page:  1  2  3  (Next)
  ALL

A

Picture of Yee Wei Law

Activation function: contemporary options

by Yee Wei Law - Saturday, 18 January 2025, 2:46 PM
 

This knowledge base entry follows discussion of artificial neural networks and backpropagation.

Contemporary options for are the non-saturating activation functions [Mur22, Sec. 13.4.3], although the term is not accurate.

Below, should be understood as the output of the summing junction.

  • The rectified linear unit (ReLU) [NH10] is the unipolar function:

    ReLU is differentiable except at , but by definition, for .

    ReLU has the advantage of having well-behaved derivatives, which are either 0 or 1.

    This simplifies optimisation [ZLLS23, Sec. 5.1.2.1] and mitigates the infamous vanishing gradients problem associated with traditional activation functions.

    ReLU has gained dominance since its introduction.

    ReLU is implemented by the PyTorch function ReLU.

    However, ReLU suffers from the 💀 “dying ReLU” problem during training, when some neurons stop outputting anything other than 0 [G22, Ch. 11]:

    • During training, if a neuron’s weights get updated such that the weighted sum of the neuron’s inputs is negative, the neuron will start outputting 0.
    • When this happens, the neuron is unlikely to resurrect since the gradient of the ReLU function is 0 when its input is negative.
    • In some cases, half of the neurons die, especially when a large learning rate is used.
  • The leaky ReLU (LReLU) [MHN+13] is one of the earliest extensions of ReLU:

    where is fixed and typically set to .

    LReLU is differentiable except at , but by definition, for , thus avoiding the dying ReLU problem.

  • The parametric ReLU (PReLU) [HZRS15] extends LReLU:

    where is a tunable parameter controlling the slope of the negative part of PReLU, and is to be learnt jointly with the model in end-to-end training.

    PReLU is implemented by the PyTorch function PReLU.

  • The exponential linear unit (ELU) [CUH16] is a smooth extension of LReLU:

    where is fixed; see Fig. 1.

    ELU is implemented by the PyTorch function ELU.

    Fig. 1: A plot of the response of an ELU with .
  • The scaled exponential linear unit or self-normalising ELU (SELU) [KUMH17] extends ELU:

    where ensures a slope of larger than 1 for positive inputs; see Fig. 2.

    SELU was invented for self-normalising neural networks (SNNs), which are meant to 1️⃣ be robust to perturbations, 2️⃣ not have high variance in their training errors.

    SNNs push neuron activations to zero mean and unit variance, leading to the same effect as batch normalisation, which enables robust deep learning.

    SELU is implemented by the PyTorch function SELU.

    Fig. 2: A plot of the response of a SELU with .
  • The Gaussian error linear unit (GELU) [HG20] extends ReLU and ELU:

    where is the cumulative distribution function for the Gaussian distribution, and is the error function .

    Unlike most other activation functions, GELU is not convex or monotonic; the increased curvature and non-monotonicity may allow GELUs to more easily approximate complicated functions than ReLUs or ELUs can.

    ReLU gates the input depending upon its sign, whereas GELU weights its input depending upon how much greater it is than other inputs.

    GELU is a popular choice for implementing transformers; see for example Hugging Face’s implementation of activation functions.

    GELU is implemented by the PyTorch function GELU.

    Fig. 3: A plot of the response of a GELU with .

References

[CUH16] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate deep network learning by exponential linear units (ELUs), in ICLR, 2016. Available at https://arxiv.org/abs/1511.07289.
[G22] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed., O’Reilly Media, Inc., 2022. Available at https://learning.oreilly.com/library/view/hands-on-machine-learning/9781098125967/.
[HZRS15] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034. https://doi.org/10.1109/ICCV.2015.123.
[HG20] D. Hendrycks and K. Gimpel, Gaussian error linear units (GELUs), arXiv preprint arXiv:1606.08415, 2020, first appeared in 2016.
[KUMH17] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, Self-normalizing neural networks, in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), 30, Curran Associates, Inc., 2017. Available at https://proceedings.neurips.cc/paper_files/paper/2017/file/5d44ee6f2c3f71b73125876103c8f6c4-Paper.pdf.
[MHN+13] A. L. Maas, A. Y. Hannun, A. Y. Ng, and others, Rectifier nonlinearities improve neural network acoustic models, in Proceedings of the 30th International Conference on Machine Learning, 2013. Available at http://robotics.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf.
[Mur22] K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.
[NH10] V. Nair and G. E. Hinton, Rectified linear units improve restricted Boltzmann machines, in Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, Omnipress, Madison, WI, USA, 2010, p. 807–814.
[ZLLS23] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, 2023, interactive online book, accessed 17 Feb 2023. Available at https://d2l.ai/.

Picture of Yee Wei Law

Active learning

by Yee Wei Law - Wednesday, 25 October 2023, 9:39 AM
 

References

[] .

Picture of Yee Wei Law

Adversarial machine learning

by Yee Wei Law - Tuesday, 21 January 2025, 11:31 PM
 

Adversarial machine learning (AML) as a field can be traced back to [HJN+11].

AML is the study of 1️⃣ the capabilities of attackers and their goals, as well as the design of attack methods that exploit the vulnerabilities of ML during the ML life cycle; 2️⃣ the design of ML algorithms that can withstand these security and privacy challenges [OV24].

The impact of adversarial examples on deep learning is well known within the computer vision community, and documented in a body of literature that has been growing exponentially since Szegedy et al.’s discovery [SZS+14].

The field is moving so fast that the taxonomy, terminology and threat models are still being standardised.

See MITRE ATLAS.

References

[HJN+11] L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. D. Tygar, Adversarial machine learning, in Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, AISec ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 43 – 58. https://doi.org/10.1145/2046684.2046692.
[OV24] A. Oprea, A. Vassilev, A. Fordyce, and H. Anderson, Adversarial machine learning: A taxonomy and terminology of attacks and mitigations, NIST AI 100-2e2023 ipd, January 2024. https://doi.org/10.6028/NIST.AI.100-2e2023.
[SZS+14] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, Intriguing properties of neural networks, in International Conference on Learning Representations, 2014. Available at https://research.google/pubs/pub42503/.

Picture of Yee Wei Law

Apache MXNet

by Yee Wei Law - Monday, 1 July 2024, 9:04 AM
 

Deep learning library Apache MXNet reached version 1.9.1 when it was retired in 2023.

Despite its obsolescence, there are MXNet-based projects that have not yet been ported to other libraries.

In the process of porting these projects, it is useful to be able to evaluate their performance in MXNet, and hence it is useful to be able to set up MXNet.

The problem is the dependencies of MXNet have not been updated for a while, and installation is not as straightforward as the installation guide makes it out to be. The installation guide here is applicable to Ubuntu 24.04 LTS on WSL2 and requires

  • NumPy version 1.23.5 (last version before 1.24, which is incompatible with MXNet),
  • Python 3.10 (as required by NumPy 1.23.5):
    conda install python=3.10 numpy=1.23.5 pip
  • CUDA Toolkit 11.8 (last version before 12),
  • cuDNN v8.9.7 (latest version applicable to CUDA 11.x, and the Ubuntu22.04 x86_64 variant works for Ubuntu 24.04),
  • NCCL 2.16.5 (latest version supporting CUDA 11.8).

After setting up all the above, do

pip install mxnet-cu112

Some warnings like this will appear but are inconsequential: cuDNN lib mismatch: linked-against version 8907 != compiled-against version 8101. Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.


Picture of Yee Wei Law

Artificial neural networks and backpropagation

by Yee Wei Law - Monday, 27 January 2025, 4:45 PM
 
See 👇 attachment.
Tags:

Picture of Yee Wei Law

Autoencoders

by Yee Wei Law - Sunday, 19 January 2025, 10:54 AM
 

An autoencoder

References

[Mur22] K. P. Murphy, Probabilistic Machine Learning: An Introduction, MIT Press, 2022. Available at http://probml.ai.
[ZLLS23] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, 2023, interactive online book, accessed 17 Feb 2023. Available at https://d2l.ai/.

B

Picture of Yee Wei Law

Batch normalisation (BatchNorm)

by Yee Wei Law - Saturday, 24 June 2023, 3:32 PM
 

Watch a high-level explanation of BatchNorm:

Watch more detailed explanation of BatchNorm by Prof Ng:

Watch coverage of BatchNorm in Stanford 2016 course CS231n Lecture 5 Part 2:

References

[IS15] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proceedings of the 32nd International Conference on Machine Learning (F. Bach and D. Blei, eds.), Proceedings of Machine Learning Research 37, PMLR, Lille, France, 07–09 Jul 2015, pp. 448–456.
[LWS+17] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou, Revisiting batch normalization for practical domain adaptation, in ICLR workshop, 2017. Available at https://openreview.net/pdf?id=Hk6dkJQFx.
[STIM18] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, How does batch normalization help optimization?, in Advances in Neural Information Processing Systems (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), 31, Curran Associates, Inc., 2018. Available at https://proceedings.neurips.cc/paper_files/paper/2018/file/905056c1ac1dad141560467e0a99e1cf-Paper.pdf.
[Zha20] X.-D. Zhang, A Matrix Algebra Approach to Artificial Intelligence, Springer, 2020. https://doi.org/10.1007/978-981-15-2770-8.

C

Picture of Yee Wei Law

Convolutional neural networks

by Yee Wei Law - Thursday, 9 January 2025, 3:31 PM
 

The convolutional neural network (ConvNet or CNN) is an evolution of the multilayer perceptron that replaces matrix multiplication with convolution in at least one layer[GBC16, Ch. 9].

A deep CNN is a CNN that has more than three hidden layers.

CNNs play an important role in the history of deep learning because[GBC16, §9.11]:

  • They exemplify successful applications of neuroscientific insights[Lin21] to machine learning.
  • CNNs are among the first neural networks to achieve commercial success; for example, AT&T used the LeNet-5 CNN to read bank checks in the 1990s[LBBH98].
  • Deep CNNs are among the first to be successfully trained with backpropagation.
  • Deep CNNs are among the first deep neural networks to achieve state-of-the-art results for challenging problems; for example, the ground-breaking performance of the 8-layer CNN called AlexNet in the 2012 ImageNet challenge[KSH17] is widely considered to be the watershed event that propelled deep learning research.
  • Deep CNNs remain pertinent for contemporary applications; for example, a combination of CNN and LSTM has been applied to predicting aircraft trajectory for assisting air crews in their decision-making during the approach phase of a flight[LPM21].

The following discusses the basic CNN structure and zooms in on the structural elements.

Structure

The core CNN structure is inspired by the visual cortex of cats.

The receptive field of a cell in the visual system may be defined as the region of retina (or visual field) over which one can influence the firing of that cell[HW62]. In a cat’s visual cortex, the majority of receptive fields can be classified as either simple or complex[HW62, LB02, Lin21]:

  • The simple cells respond to bars of light or dark when placed at specific spatial locations.

    For each cell, there is an orientation of the bar at which the cell fires the most, with its response declining as the angle of the bar changes from the optimal/preferred orientation.

    In a nutshell, the simple cells are locally sensitive and orientation-selective.

  • The complex cells have less strict response profiles.

    These cells are also sensitive to the bar’s orientation, but can respond just as strongly to a bar in several different nearby locations.

    These complex cells receive input from several simple cells, all with the same preferred orientation but with slightly different preferred locations. In other words, the response of the complex cells is shift/translation-invariant; see Fig. 1.

Fig. 1: In an experiment conducted by Hubel and Wiesel on complex cells[HW62, p. 119], a dark bar was placed against a bright background. Vigorous firing was observed regardless of the position of the bar, provided the bar was horizontal and within the receptive field (A-C). If the bar was tipped more than 10° in either direction, no firing was observed (D-E). Diagram from [HW62, Text-fig. 7].
Fig. 2: The structure on the left depicts a “neocognitron”, which is a hierarchy of afferent S-cells (simple cells) feeding into C-cells (complex cells). The S-cells have preferred locations (dashed ovals) in the image, where they respond strongly to bars of preferred orientations. The C-cells collect inputs from the S-cells and exhibit more spatially invariant responses. The structure on the right shows the core CNN structure mirroring the neocognitron, which consists of an input layer connected to a convolutional layer connected to a pooling layer. Diagram from [Lin21, Figure 1].

As a predecessor to the CNN, Fukushima’s neural network model “neocognitron”[Fuk80], as shown in Fig. 2, is a hierarchy of alternating layers of “S-cells” (modelling simple cells) and “C-cells” (modelling complex cells).

The neocognitron performs layer-wise unsupervised learning (clustering to be specific)[GBC16, §9.10], such that none of the C-cells in the last layer responds to more than one stimulus pattern[Fuk80]. Furthermore, the response is invariant to the pattern’s position, small changes in shape or size[Fuk80].

Inspired by the neocognitron, the core CNN structure, as shown in Fig. 2, has convolution layers that mimic the behavior of S-cells, and pooling layers that mimic the behavior of C-cells.

The output of a convolution layer is called a feature map[ZLLS23, §7.2.6].

Not shown in Fig. 2. is a nonlinear activation (e.g., ReLU) layer between the convolution layer and pooling layer; these three layers implement the three-stage processing that characterizes the CNN[GBC16, §9.3].

The nonlinear activation stage is sometimes called the detector stage[GBC16, §9.3].

Invariance to local translation is useful if detecting the presence of a feature is more important than localizing the feature.

When it was introduced, the CNN brought three architectural innovations to achieve shift/translation-invariance[LB02, GBC16]:

  1. Local receptive fields (or sparse interactions): In traditional neural networks, every output unit is connected to every input unit through matrix multiplication.

    For any element of some layer, its receptive field refers to all the elements (from all the previous layers) that may affect the calculation of during forward propagation[ZLLS23, §7.2.6].

    CNNs force the extraction of local features by restricting the receptive fields of hidden units to be local and only as large as the size of a kernel/filter (see next section). In other words, CNNs enforce sparse interactions; see Fig. 3.

    Thus, compared to traditional neural networks, CNNs 1️⃣ need less memory because there are less parameters/weights to store, 2️⃣ have better statistical efficiency, 3️⃣ are more computationally efficient.

  2. Shared weights (or tied weights or weight replication or parameter sharing): This refers to using the same parameter/weight for more than one function in a model.

    More concretely, the value of a weight applied to one input is tied to the value of a weight applied elsewhere. This happens because each element of a kernel/filter is applied to every element of the input (every pixel if the input is an image), barring some boundary elements.

    In contrast, for a traditional neural network, each element of the weight matrix is used exactly once when computing the output of a layer.

  3. Pooling (or subsampling): This is discussed in the last section. That is also where a complete example of a CNN is shown.
Fig. 3: The sparse connectivity of a CNN (top) vs the full connectivity of a traditional neural network. Consider at the top, its receptive field consists of only . The size of the 1D convolution kernel/filter is 3 in this case. Diagram from [GBC16, Figure 9.3].

Structural element: convolution

It has to be emphasised that the convolution in this context is inspired by but not equivalent to the convolution in linear system theory and furthermore it exists in several variants[GBC16, Ch. 9], but it is still a linear operation.

In signal processing, if denotes a time-dependent input and denotes the impulse response function of a linear system, then the output response of the system is given by the convolution (denoted by symbol ) of and :

In discrete time, the equation above can be written as

Above, square brackets are used to distinguish discrete time from continuous time.

In machine learning (ML), is called a kernel or filter and the output of convolution is an example of a feature map.

For two-dimensional (2D) inputs (e.g., images), we use 2D kernels in convolution:

Convolution works similarly to windowed (Gabor) Fourier transforms[BK19, §6.5] and wavelet transforms[Mal16].

Convolution is not to be confused with cross-correlation, but for efficiency, convolution is often implemented as cross-correlation in ML libraries. For example, PyTorch implements 2D convolution (Conv2d) as cross-correlation (denoted by symbol ):

Note above, the indices of are and instead of and . The absence of flipping of the indices makes cross-correlation more efficient.

Both convolution and cross-correlation are invariant and equivariant to shift/translation.

Both convolution and cross-correlation is neither scale- nor rotation-invariant[GBC16].

Instead of convolution, CNNs typically use cross-correlation because ...

Fig. 4 animates an example of a cross-correlation operation using an edge detector filter.

To an image, applying an filter at a stride of produces a feature map of size

assuming is a multiple of [Mad21].

Fig. 4: An example of a cross-correlation operation [Mad21], where a 3x3 edge detector filter is applied to a 5x5 image. Starting from the upper left corner, the filter is slid rightwards by 1 pixel or downwards by 1 pixel at a time; in other words, the operation has a stride of 1.

Structural element: pooling

A pooling function replaces the output of a neural network at a certain location with a summary statistic (e.g., maximum, average) of the nearby outputs[GBC16, §9.3]. The number of these nearby outputs is the pool width. Like a convolution/cross-correlation filter, a pooling window is slid over the input one stride at a time.

Fig. 5 illustrates the maximum-pooling (max-pooling for short) operation[RP99].

Pooling helps make a feature map approximately invariant to small translations of the input. Furthermore, for many tasks, pooling is essential for handling inputs of varying size.

Fig. 6 illustrates the role of max-pooling in downsampling.

As a summary, Fig. 7 shows the structure of an example of an early CNN[LB02]. Note though the last layers of a modern CNN are typically fully connected layers (also known as dense layers).

Fig. 5: (Top) Max-pooling is applied to the detector stage 3 units at a time and at a stride of 1. (Bottom) Even when all the units in the detector stage change in value, only one unit in the pool stage changes in value, exhibiting a high degree of invariance. Diagram from [GBC16, Figure 9.8].
Fig. 6: Like Fig. 5, a pool width of 3 is used (except for the rightmost pool), but here, the stride is 2, halving the size of the feature map. Diagram from [GBC16, Figure 9.10].
Fig. 7: An example of an early CNN using 5x5 convolution filters for recognizing handwriting. The dimension of the first feature map is (28-5+1=24)x(24). For the first convolution layer, 4 different filters are used, resulting in 4 feature maps. The subsampling layers are obtained by applying 2x2 average-pooling. The 26 output units correspond to 26 alphabets. Diagram from [LB02, Figure 1].

References

[BK19] S.L. Brunton and J.N. Kutz, Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control, Cambridge University Press, 2019. https://doi.org/10.1017/9781108380690.
[Fuk80] K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biological Cybernetics 36 no. 4 (1980), 193–202. https://doi.org/10.1007/BF00344251.
[GBC16] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016. Available at https://www.deeplearningbook.org.
[GK22] P. Grohs and G. Kutyniok, Mathematical Aspects of Deep Learning, Cambridge University Press, 2022. https://doi.org/10.1017/9781009025096.
[HW62] D. H. Hubel and T. N. Wiesel, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, J. Physiol. 160 (1962), 106–154. https://doi.org/10.1113/jphysiol.1962.sp006837.
[KSH17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM 60 no. 6 (2017), 84 – 90, journal version of the paper with the same name that appeared in NIPS 2012. https://doi.org/10.1145/306538.
[LBBH98] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86" no. 11 (1998), 2278–2324. https://doi.org/10.1109/5.726791.
[LB02] Y. LeCun and Y. Bengio, Convolutional networks for images, speech, and time series, in The Handbook of Brain Theory and Neural Networks (M. A. Arbib, ed.), MIT Press, 2nd ed., 2002, 1st edition in 1995, p. 276–279. https://doi.org/10.7551/mitpress/3413.001.0001.
[LPM21] H. Lee, T. G. Puranik, and D. N. Mavris, Deep spatio-temporal neural networks for risk prediction and decision support in aviation operations, Journal of Computing and Information Science in Engineering 21 no. 4 (2021), 041013. https://doi.org/10.1115/1.4049992.
[Lin21] G. W. Lindsay, Convolutional neural networks as a model of the visual system: Past, present, and future, Journal of Cognitive Neuroscience 33 no. 10 (2021), 2017–2031. https://doi.org/10.1162/jocn_a_01544.
[Mad21] S. Madhavan, Introduction to convolutional neural networks: Explore the different steps that go into creating a convolutional neural network, IBM Developer article, 2021. Available at https://developer.ibm.com/articles/introduction-to-convolutional-neural-networks/.
[Mal16] S. Mallat, Understanding deep convolutional networks, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374 no. 2065 (2016), 20150203. https://doi.org/10.1098/rsta.2015.0203.
[RP99] M. Riesenhuber and T. Poggio, Hierarchical models of object recognition in cortex, Nature Neuroscience 2 no. 11 (1999), 1019–1025. https://doi.org/10.1038/14819.
[SB18] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.
[ZLLS23] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, Cambridge University Press, 2023. Available at https://d2l.ai/.

Picture of Yee Wei Law

Cross-entropy loss

by Yee Wei Law - Friday, 31 March 2023, 1:40 PM
 

[Cha19, pp. 11-14]

References

[Cha19] E. Charniak, Introduction to Deep Learning, MIT Press, 2019. Available at https://ebookcentral.proquest.com/lib/unisa/reader.action?docID=6331506.

D

Picture of Yee Wei Law

Domain adaptation

by Yee Wei Law - Wednesday, 14 June 2023, 10:55 AM
 

Domain adaptation is learning a discriminative classifier or other predictor in the presence of a shift of data distribution between the source/training domain and the target/test domain [GUA+16].

References

[GUA+16] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky, Domain-adversarial training of neural networks, Journal of Machine Learning Research 17 no. 59 (2016), 1–35.


Page:  1  2  3  (Next)
  ALL