Browse the glossary using this index

Special | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ALL

Page: 1 2 3 (Next)
ALL

A

Activation function: contemporary options

by Yee Wei Law - Saturday, 18 January 2025, 2:46 PM

This knowledge base entry follows discussion of artificial neural networks and backpropagation.

Contemporary options for $\varphi$ are the non-saturating activation functions [Mur22, Sec. 13.4.3], although the term is not accurate.

Below, $x$ should be understood as the output of the summing junction.

The rectified linear unit (ReLU) [NH10] is the unipolar function:

$\varphi(x) = \text{ReLU}(x) \triangleq \max(0,x) = \begin{cases} x & \text{if }x>0, \\ 0 & \text{otherwise.} \end{cases}$

ReLU is differentiable except at $x=0$ , but by definition, $\text{ReLU}'(x)=0$ for $x\leq0$ .

ReLU has the advantage of having well-behaved derivatives, which are either 0 or 1.

This simplifies optimisation [ZLLS23, Sec. 5.1.2.1] and mitigates the infamous vanishing gradients problem associated with traditional activation functions.

ReLU has gained dominance since its introduction.

ReLU is implemented by the PyTorch function ReLU.

However, ReLU suffers from the 💀 “dying ReLU” problem during training, when some neurons stop outputting anything other than 0 [G22, Ch. 11]:
- During training, if a neuron’s weights get updated such that the weighted sum of the neuron’s inputs is negative, the neuron will start outputting 0.
- When this happens, the neuron is unlikely to resurrect since the gradient of the ReLU function is 0 when its input is negative.
- In some cases, half of the neurons die, especially when a large learning rate is used.
The leaky ReLU (LReLU) [MHN+13] is one of the earliest extensions of ReLU:

$\varphi(x) = \text{LReLU}(x) \triangleq \max(ax,x) = \begin{cases} x & \text{if }x>0, \\ ax & \text{otherwise,} \end{cases}$

where $0 \lt a \lt 1$ is fixed and typically set to $0.01$ .

LReLU is differentiable except at $x=0$ , but by definition, $\text{LReLU}'(x)=a>0$ for $x\leq0$ , thus avoiding the dying ReLU problem.
The parametric ReLU (PReLU) [HZRS15] extends LReLU:

$\varphi(x) = \text{PReLU}(x) \triangleq \max(ax,x) = \begin{cases} x & \text{if }x>0, \\ ax & \text{otherwise,} \end{cases}$

where $0 \lt a \lt 1$ is a tunable parameter controlling the slope of the negative part of PReLU, and is to be learnt jointly with the model in end-to-end training.

PReLU is implemented by the PyTorch function PReLU.
The exponential linear unit (ELU) [CUH16] is a smooth extension of LReLU:

$\varphi(x) = \text{ELU}(x) \triangleq \begin{cases} x & \text{if }x>0, \\ \alpha(e^x-1) & \text{otherwise,} \end{cases}$

where $\alpha>0$ is fixed; see Fig. 1.

ELU is implemented by the PyTorch function ELU.

Fig. 1: A plot of the response of an ELU with $\alpha=1$ .
The scaled exponential linear unit or self-normalising ELU (SELU) [KUMH17] extends ELU:

$\varphi(x) = \text{SELU}(x) \triangleq \lambda\text{ELU}(x),$

where $\lambda>1$ ensures a slope of larger than 1 for positive inputs; see Fig. 2.

SELU was invented for self-normalising neural networks (SNNs), which are meant to 1️⃣ be robust to perturbations, 2️⃣ not have high variance in their training errors.

SNNs push neuron activations to zero mean and unit variance, leading to the same effect as batch normalisation, which enables robust deep learning.

SELU is implemented by the PyTorch function SELU.

Fig. 2: A plot of the response of a SELU with $\alpha=1$ .
The Gaussian error linear unit (GELU) [HG20] extends ReLU and ELU:

$\varphi(x) = \text{GELU}(x) \triangleq x\Phi(x) = \dfrac{x}{2}\left[1+\text{erf}(x/\sqrt{2})\right],$

where $\Phi(x)$ is the cumulative distribution function for the Gaussian distribution, and $\text{erf}$ is the error function $\text{erf}(x)=(2/\sqrt{\pi})\int_0^xe^{-t^2}dt$ .

Unlike most other activation functions, GELU is not convex or monotonic; the increased curvature and non-monotonicity may allow GELUs to more easily approximate complicated functions than ReLUs or ELUs can.

ReLU gates the input depending upon its sign, whereas GELU weights its input depending upon how much greater it is than other inputs.

GELU is a popular choice for implementing transformers; see for example Hugging Face’s implementation of activation functions.

GELU is implemented by the PyTorch function GELU.

Fig. 3: A plot of the response of a GELU with $\alpha=1$ .

References

[CUH16]	D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate deep network learning by exponential linear units (ELUs), in ICLR, 2016. Available at https://arxiv.org/abs/1511.07289.
[G22]	A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed., O’Reilly Media, Inc., 2022. Available at https://learning.oreilly.com/library/view/hands-on-machine-learning/9781098125967/.
[HZRS15]	K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034. https://doi.org/10.1109/ICCV.2015.123.
[HG20]	D. Hendrycks and K. Gimpel, Gaussian error linear units (GELUs), arXiv preprint arXiv:1606.08415, 2020, first appeared in 2016.
[KUMH17]	G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, Self-normalizing neural networks, in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), 30, Curran Associates, Inc., 2017. Available at https://proceedings.neurips.cc/paper_files/paper/2017/file/5d44ee6f2c3f71b73125876103c8f6c4-Paper.pdf.
[MHN+13]	A. L. Maas, A. Y. Hannun, A. Y. Ng, and others, Rectifier nonlinearities improve neural network acoustic models, in Proceedings of the 30th International Conference on Machine Learning, 2013. Available at http://robotics.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf.
[Mur22]	K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.
[NH10]	V. Nair and G. E. Hinton, Rectified linear units improve restricted Boltzmann machines, in Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, Omnipress, Madison, WI, USA, 2010, p. 807–814.
[ZLLS23]	A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, 2023, interactive online book, accessed 17 Feb 2023. Available at https://d2l.ai/.

Keyword(s):

Active learning

by Yee Wei Law - Wednesday, 25 October 2023, 9:39 AM

References

[]

Adversarial machine learning

by Yee Wei Law - Tuesday, 21 January 2025, 11:31 PM

Adversarial machine learning (AML) as a field can be traced back to [HJN+11].

AML is the study of 1️⃣ the capabilities of attackers and their goals, as well as the design of attack methods that exploit the vulnerabilities of ML during the ML life cycle; 2️⃣ the design of ML algorithms that can withstand these security and privacy challenges [OV24].

The impact of adversarial examples on deep learning is well known within the computer vision community, and documented in a body of literature that has been growing exponentially since Szegedy et al.’s discovery [SZS+14].

The field is moving so fast that the taxonomy, terminology and threat models are still being standardised.

See MITRE ATLAS.

References

[HJN+11]	L. Huang, A. D. Joseph, B. Nelson, B. I. Rubinstein, and J. D. Tygar, Adversarial machine learning, in Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, AISec ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 43 – 58. https://doi.org/10.1145/2046684.2046692.
[OV24]	A. Oprea, A. Vassilev, A. Fordyce, and H. Anderson, Adversarial machine learning: A taxonomy and terminology of attacks and mitigations, NIST AI 100-2e2023 ipd, January 2024. https://doi.org/10.6028/NIST.AI.100-2e2023.
[SZS+14]	C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, Intriguing properties of neural networks, in International Conference on Learning Representations, 2014. Available at https://research.google/pubs/pub42503/.

Apache MXNet

by Yee Wei Law - Monday, 1 July 2024, 9:04 AM

Deep learning library Apache MXNet reached version 1.9.1 when it was retired in 2023.

Despite its obsolescence, there are MXNet-based projects that have not yet been ported to other libraries.

In the process of porting these projects, it is useful to be able to evaluate their performance in MXNet, and hence it is useful to be able to set up MXNet.

The problem is the dependencies of MXNet have not been updated for a while, and installation is not as straightforward as the installation guide makes it out to be. The installation guide here is applicable to Ubuntu 24.04 LTS on WSL2 and requires

NumPy version 1.23.5 (last version before 1.24, which is incompatible with MXNet),
Python 3.10 (as required by NumPy 1.23.5):
conda install python=3.10 numpy=1.23.5 pip
CUDA Toolkit 11.8 (last version before 12),
cuDNN v8.9.7 (latest version applicable to CUDA 11.x, and the Ubuntu22.04 x86_64 variant works for Ubuntu 24.04),
NCCL 2.16.5 (latest version supporting CUDA 11.8).

After setting up all the above, do

pip install mxnet-cu112

Some warnings like this will appear but are inconsequential: cuDNN lib mismatch: linked-against version 8907 != compiled-against version 8101. Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.

Artificial neural networks and backpropagation

by Yee Wei Law - Sunday, 4 May 2025, 10:38 PM

See 👇 attachment.

t_nn.pdf

Keyword(s):

Autoencoders

by Yee Wei Law - Sunday, 19 January 2025, 10:54 AM

An autoencoder

References

[Mur22]	K. P. Murphy, Probabilistic Machine Learning: An Introduction, MIT Press, 2022. Available at http://probml.ai.
[ZLLS23]	A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, 2023, interactive online book, accessed 17 Feb 2023. Available at https://d2l.ai/.

B

Batch normalisation (BatchNorm)

by Yee Wei Law - Saturday, 24 June 2023, 3:32 PM

Watch a high-level explanation of BatchNorm:

Watch more detailed explanation of BatchNorm by Prof Ng:

Watch coverage of BatchNorm in Stanford 2016 course CS231n Lecture 5 Part 2:

References

[IS15]	S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proceedings of the 32nd International Conference on Machine Learning (F. Bach and D. Blei, eds.), Proceedings of Machine Learning Research 37, PMLR, Lille, France, 07–09 Jul 2015, pp. 448–456.
[LWS+17]	Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou, Revisiting batch normalization for practical domain adaptation, in ICLR workshop, 2017. Available at https://openreview.net/pdf?id=Hk6dkJQFx.
[STIM18]	S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, How does batch normalization help optimization?, in Advances in Neural Information Processing Systems (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), 31, Curran Associates, Inc., 2018. Available at https://proceedings.neurips.cc/paper_files/paper/2018/file/905056c1ac1dad141560467e0a99e1cf-Paper.pdf.
[Zha20]	X.-D. Zhang, A Matrix Algebra Approach to Artificial Intelligence, Springer, 2020. https://doi.org/10.1007/978-981-15-2770-8.

C

Convolutional neural networks

by Yee Wei Law - Thursday, 9 January 2025, 3:31 PM

The convolutional neural network (ConvNet or CNN) is an evolution of the multilayer perceptron that replaces matrix multiplication with convolution in at least one layer^{[GBC16, Ch. 9]}.

A deep CNN is a CNN that has more than three hidden layers.

CNNs play an important role in the history of deep learning because^{[GBC16, §9.11]}:

They exemplify successful applications of neuroscientific insights^[Lin21] to machine learning.
CNNs are among the first neural networks to achieve commercial success; for example, AT&T used the LeNet-5 CNN to read bank checks in the 1990s^[LBBH98].
Deep CNNs are among the first to be successfully trained with backpropagation.
Deep CNNs are among the first deep neural networks to achieve state-of-the-art results for challenging problems; for example, the ground-breaking performance of the 8-layer CNN called AlexNet in the 2012 ImageNet challenge^[KSH17] is widely considered to be the watershed event that propelled deep learning research.
Deep CNNs remain pertinent for contemporary applications; for example, a combination of CNN and LSTM has been applied to predicting aircraft trajectory for assisting air crews in their decision-making during the approach phase of a flight^[LPM21].

The following discusses the basic CNN structure and zooms in on the structural elements.

Structure

The core CNN structure is inspired by the visual cortex of cats.

The receptive field of a cell in the visual system may be defined as the region of retina (or visual field) over which one can influence the firing of that cell^[HW62]. In a cat’s visual cortex, the majority of receptive fields can be classified as either simple or complex^{[HW62, LB02, Lin21]}:

The simple cells respond to bars of light or dark when placed at specific spatial locations.

For each cell, there is an orientation of the bar at which the cell fires the most, with its response declining as the angle of the bar changes from the optimal/preferred orientation.

In a nutshell, the simple cells are locally sensitive and orientation-selective.
The complex cells have less strict response profiles.

These cells are also sensitive to the bar’s orientation, but can respond just as strongly to a bar in several different nearby locations.

These complex cells receive input from several simple cells, all with the same preferred orientation but with slightly different preferred locations. In other words, the response of the complex cells is shift/translation-invariant; see Fig. 1.

Fig. 1: In an experiment conducted by Hubel and Wiesel on complex cells^{[HW62, p. 119]}, a dark bar was placed against a bright background. Vigorous firing was observed regardless of the position of the bar, provided the bar was horizontal and within the receptive field (A-C). If the bar was tipped more than 10° in either direction, no firing was observed (D-E). Diagram from [HW62, Text-fig. 7].

Fig. 2: The structure on the left depicts a “neocognitron”, which is a hierarchy of afferent S-cells (simple cells) feeding into C-cells (complex cells). The S-cells have preferred locations (dashed ovals) in the image, where they respond strongly to bars of preferred orientations. The C-cells collect inputs from the S-cells and exhibit more spatially invariant responses. The structure on the right shows the core CNN structure mirroring the neocognitron, which consists of an input layer connected to a convolutional layer connected to a pooling layer. Diagram from [Lin21, Figure 1].

As a predecessor to the CNN, Fukushima’s neural network model “neocognitron”^[Fuk80], as shown in Fig. 2, is a hierarchy of alternating layers of “S-cells” (modelling simple cells) and “C-cells” (modelling complex cells).

The neocognitron performs layer-wise unsupervised learning (clustering to be specific)^{[GBC16, §9.10]}, such that none of the C-cells in the last layer responds to more than one stimulus pattern^[Fuk80]. Furthermore, the response is invariant to the pattern’s position, small changes in shape or size^[Fuk80].

Inspired by the neocognitron, the core CNN structure, as shown in Fig. 2, has convolution layers that mimic the behavior of S-cells, and pooling layers that mimic the behavior of C-cells.

The output of a convolution layer is called a feature map^{[ZLLS23, §7.2.6]}.

Not shown in Fig. 2. is a nonlinear activation (e.g., ReLU) layer between the convolution layer and pooling layer; these three layers implement the three-stage processing that characterizes the CNN^{[GBC16, §9.3]}.

The nonlinear activation stage is sometimes called the detector stage^{[GBC16, §9.3]}.

Invariance to local translation is useful if detecting the presence of a feature is more important than localizing the feature.

When it was introduced, the CNN brought three architectural innovations to achieve shift/translation-invariance^{[LB02, GBC16]}:

Local receptive fields (or sparse interactions): In traditional neural networks, every output unit is connected to every input unit through matrix multiplication.

For any element $x$ of some layer, its receptive field refers to all the elements (from all the previous layers) that may affect the calculation of $x$ during forward propagation^{[ZLLS23, §7.2.6]}.

CNNs force the extraction of local features by restricting the receptive fields of hidden units to be local and only as large as the size of a kernel/filter (see next section). In other words, CNNs enforce sparse interactions; see Fig. 3.

Thus, compared to traditional neural networks, CNNs 1️⃣ need less memory because there are less parameters/weights to store, 2️⃣ have better statistical efficiency, 3️⃣ are more computationally efficient.
Shared weights (or tied weights or weight replication or parameter sharing): This refers to using the same parameter/weight for more than one function in a model.

More concretely, the value of a weight applied to one input is tied to the value of a weight applied elsewhere. This happens because each element of a kernel/filter is applied to every element of the input (every pixel if the input is an image), barring some boundary elements.

In contrast, for a traditional neural network, each element of the weight matrix is used exactly once when computing the output of a layer.
Pooling (or subsampling): This is discussed in the last section. That is also where a complete example of a CNN is shown.

Fig. 3: The sparse connectivity of a CNN (top) vs the full connectivity of a traditional neural network. Consider $s_3$ at the top, its receptive field consists of only $x_2,x_3,x_4$ . The size of the 1D convolution kernel/filter is 3 in this case. Diagram from [GBC16, Figure 9.3].

Fig. 3: The sparse connectivity of a CNN (top) vs the full connectivity of a traditional neural network. Consider $s_3$ at the top, its receptive field consists of only $x_2,x_3,x_4$ . The size of the 1D convolution kernel/filter is 3 in this case. Diagram from [GBC16, Figure 9.3].

Structural element: convolution

It has to be emphasised that the convolution in this context is inspired by but not equivalent to the convolution in linear system theory and furthermore it exists in several variants^{[GBC16, Ch. 9]}, but it is still a linear operation.

In signal processing, if $x(t)$ denotes a time-dependent input and $f(t)$ denotes the impulse response function of a linear system, then the output response of the system is given by the convolution (denoted by symbol $\ast$ ) of $x(t)$ and $f(t)$ :

$(x\ast f)(t) = \int_{\tau=0}^t x(\tau)f(t-\tau)d\tau.$

In discrete time, the equation above can be written as

$(x\ast f)[t] = \sum_{\tau=0}^t x[t]\cdot f[t-\tau].$

Above, square brackets are used to distinguish discrete time from continuous time.

In machine learning (ML), $f(t)$ is called a kernel or filter and the output of convolution is an example of a feature map.

For two-dimensional (2D) inputs (e.g., images), we use 2D kernels in convolution:

$(x\ast f)[i,j] = \sum_{a,b}x[a,b]\cdot f[i-a, j-b].$

Convolution works similarly to windowed (Gabor) Fourier transforms^{[BK19, §6.5]} and wavelet transforms^[Mal16].

Convolution is not to be confused with cross-correlation, but for efficiency, convolution is often implemented as cross-correlation in ML libraries. For example, PyTorch implements 2D convolution (Conv2d) as cross-correlation (denoted by symbol $\star$ ):

$(x\star f)[i,j] = \sum_{a,b}x[i+a,j+b]\cdot f[a,b].$

Note above, the indices of $f$ are $a$ and $b$ instead of $i-a$ and $j-b$ . The absence of flipping of the indices makes cross-correlation more efficient.

Both convolution and cross-correlation are invariant and equivariant to shift/translation.

Both convolution and cross-correlation is neither scale- nor rotation-invariant^[GBC16].

Instead of convolution, CNNs typically use cross-correlation because ...

Fig. 4 animates an example of a cross-correlation operation using an edge detector filter.

To an $n\times n$ image, applying an $m\times m$ filter at a stride of $k$ produces a feature map of size

$\left(\frac{n-m}{k}+1\right)\times\left(\frac{n-m}{k}+1\right),$

assuming $n-m$ is a multiple of $k$ ^[Mad21].

Fig. 4: An example of a cross-correlation operation [Mad21], where a 3x3 edge detector filter is applied to a 5x5 image. Starting from the upper left corner, the filter is slid rightwards by 1 pixel or downwards by 1 pixel at a time; in other words, the operation has a stride of 1.

Structural element: pooling

A pooling function replaces the output of a neural network at a certain location with a summary statistic (e.g., maximum, average) of the nearby outputs^{[GBC16, §9.3]}. The number of these nearby outputs is the pool width. Like a convolution/cross-correlation filter, a pooling window is slid over the input one stride at a time.

Fig. 5 illustrates the maximum-pooling (max-pooling for short) operation^[RP99].

Pooling helps make a feature map approximately invariant to small translations of the input. Furthermore, for many tasks, pooling is essential for handling inputs of varying size.

Fig. 6 illustrates the role of max-pooling in downsampling.

As a summary, Fig. 7 shows the structure of an example of an early CNN^[LB02]. Note though the last layers of a modern CNN are typically fully connected layers (also known as dense layers).

Fig. 5: (Top) Max-pooling is applied to the detector stage 3 units at a time and at a stride of 1. (Bottom) Even when *all* the units in the detector stage change in value, only *one* unit in the pool stage changes in value, exhibiting a high degree of invariance. Diagram from [GBC16, Figure 9.8].

Fig. 6: Like Fig. 5, a pool width of 3 is used (except for the rightmost pool), but here, the stride is 2, halving the size of the feature map. Diagram from [GBC16, Figure 9.10].

Fig. 7: An example of an early CNN using 5x5 convolution filters for recognizing handwriting. The dimension of the first feature map is (28-5+1=24)x(24). For the first convolution layer, 4 different filters are used, resulting in 4 feature maps. The subsampling layers are obtained by applying 2x2 average-pooling. The 26 output units correspond to 26 alphabets. Diagram from [LB02, Figure 1].

References

[BK19]	S.L. Brunton and J.N. Kutz, Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control, Cambridge University Press, 2019. https://doi.org/10.1017/9781108380690.
[Fuk80]	K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biological Cybernetics 36 no. 4 (1980), 193–202. https://doi.org/10.1007/BF00344251.
[GBC16]	I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016. Available at https://www.deeplearningbook.org.
[GK22]	P. Grohs and G. Kutyniok, Mathematical Aspects of Deep Learning, Cambridge University Press, 2022. https://doi.org/10.1017/9781009025096.
[HW62]	D. H. Hubel and T. N. Wiesel, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, J. Physiol. 160 (1962), 106–154. https://doi.org/10.1113/jphysiol.1962.sp006837.
[KSH17]	A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM 60 no. 6 (2017), 84 – 90, journal version of the paper with the same name that appeared in NIPS 2012. https://doi.org/10.1145/306538.
[LBBH98]	Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86" no. 11 (1998), 2278–2324. https://doi.org/10.1109/5.726791.
[LB02]	Y. LeCun and Y. Bengio, Convolutional networks for images, speech, and time series, in The Handbook of Brain Theory and Neural Networks (M. A. Arbib, ed.), MIT Press, 2nd ed., 2002, 1st edition in 1995, p. 276–279. https://doi.org/10.7551/mitpress/3413.001.0001.
[LPM21]	H. Lee, T. G. Puranik, and D. N. Mavris, Deep spatio-temporal neural networks for risk prediction and decision support in aviation operations, Journal of Computing and Information Science in Engineering 21 no. 4 (2021), 041013. https://doi.org/10.1115/1.4049992.
[Lin21]	G. W. Lindsay, Convolutional neural networks as a model of the visual system: Past, present, and future, Journal of Cognitive Neuroscience 33 no. 10 (2021), 2017–2031. https://doi.org/10.1162/jocn_a_01544.
[Mad21]	S. Madhavan, Introduction to convolutional neural networks: Explore the different steps that go into creating a convolutional neural network, IBM Developer article, 2021. Available at https://developer.ibm.com/articles/introduction-to-convolutional-neural-networks/.
[Mal16]	S. Mallat, Understanding deep convolutional networks, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374 no. 2065 (2016), 20150203. https://doi.org/10.1098/rsta.2015.0203.
[RP99]	M. Riesenhuber and T. Poggio, Hierarchical models of object recognition in cortex, Nature Neuroscience 2 no. 11 (1999), 1019–1025. https://doi.org/10.1038/14819.
[SB18]	R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.
[ZLLS23]	A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, Cambridge University Press, 2023. Available at https://d2l.ai/.

Cross-entropy loss

by Yee Wei Law - Friday, 31 March 2023, 1:40 PM

[Cha19, pp. 11-14]

References

[Cha19]

E. Charniak, Introduction to Deep Learning, MIT Press, 2019. Available at https://ebookcentral.proquest.com/lib/unisa/reader.action?docID=6331506.

D

Domain adaptation

by Yee Wei Law - Wednesday, 14 June 2023, 10:55 AM

Domain adaptation is learning a discriminative classifier or other predictor in the presence of a shift of data distribution between the source/training domain and the target/test domain [GUA+16].

References

[GUA+16]

Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V. Lempitsky, Domain-adversarial training of neural networks, Journal of Machine Learning Research 17 no. 59 (2016), 1–35.

Keyword(s):

Dropout

by Yee Wei Law - Tuesday, 20 June 2023, 2:35 PM

Deep neural networks (DNNs) employ a large number of parameters to learn complex dependencies of outputs on inputs, but overfitting often occurs as a result.

Large DNNs are also slow to converge.

The dropout method implements the intuitive idea of randomly dropping units (along with their connections) from a network during training [SHK+14].

Fig. 1: Sample effect of applying dropout to a neural network in (a). The thinned network in (b) has units marked with a cross removed [SHK+14, Figure 1].

References

[SHK+14]

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 no. 56 (2014), 1929–1958. Available at http://jmlr.org/papers/v15/srivastava14a.html.

F

Few-shot learning

by Yee Wei Law - Thursday, 16 February 2023, 3:29 PM

Definition 1: Few-shot learning [WYKN20, Definition 2.2]

A type of machine learning problems (specified by experience $E$ , task $T$ and performance measure $P$ ), where $E$ contains only a limited number of examples with supervised information for $T$ .

References

[]	.
[WYKN20]	Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, Generalizing from a few examples: A survey on few-shot learning, ACM Comput. Surv. 53 no. 3 (2020). https://doi.org/10.1145/3386252.

Keyword(s):

I

Invariance and equivariance

by Yee Wei Law - Monday, 6 January 2025, 3:50 PM

A function $f(x)$ of an input $x$ is invariant to a transformation $T(x)$ if

$f(T(x)) = f(x).$

In other words, function $f$ in invariant to transformation $T$ if produces the same output regardless of the output of $T$ ^{[Pri23, §10.1]}.

For example, an image classifier should be invariant to geometric transformations of an image.

A function $f(x)$ of an input $x$ is equivariant or covariant to a transformation $T(x)$ if

$f(T(x)) = T(f(x)).$

In other words, function $f$ is equivariant or covariant to transformation $T$ if the output of $f$ changes in the same way under $T$ as the input^{[Pri23, §10.1]}.

For example, when an input image is geometrically transformed in some way, the output of an image segmentation algorithm should be transformed in the same way.

References

[Pri23]

S. J. Prince, Understanding Deep Learning, MIT Press, 2023. Available at http://udlbook.com.

Keyword(s):

K

Kats by Facebook Research

by Yee Wei Law - Saturday, 29 June 2024, 11:22 PM

Facebook Research’s Kats has been billed as “one-stop shop for time series analysis in Python”. Kats supports standard time-series analyses, e.g., forecasting, anomaly/outlier detection.

The official installation instructions however do not work out of the box. At the time of writing, the official instructions lead to the error message “python setup.py bdist_wheel did not run successfully” due to incompatibility with the latest version of Python.

Based on community responses to the error message, and based on my personal experience, the following instructions work:

conda install python=3.7 pip setuptools ephem pystan fbprophet
pip install kats
pip install packaging==21.3

The following sample code should run error-free:

import numpy as np
import pandas as pd

from kats.consts import TimeSeriesData
from kats.detectors.cusum_detection import CUSUMDetector

# simulate time series with increase
np.random.seed(10)
df_increase = pd.DataFrame(
    {
        'time': pd.date_range('2019-01-01', '2019-03-01'),
        'increase':np.concatenate([np.random.normal(1,0.2,30), np.random.normal(2,0.2,30)]),
    }
)

# convert to TimeSeriesData object
timeseries = TimeSeriesData(df_increase)

# run detector and find change points
change_points = CUSUMDetector(timeseries).detector()

L

Long short-term memory (LSTM)

by Yee Wei Law - Friday, 31 January 2025, 3:37 PM

A long short-term memory (LSTM) network is a type of recurrent neural network (RNN) designed to address the problems of vanishing gradients and exploding gradients using gradient truncation and structures called “constant error carousels” for enforcing constant (as opposed to vanishing or exploding) error flow^[HS97].

LSTM solves such a fundamental problem with traditional RNNs that most of the state-of-the-art results achieved through RNNs can be attributed to LSTM^[YSHZ19].

An LSTM network replaces the traditional neural network layers with LSTM layers, each of which consists of a set of recurrently connected, differentiable memory blocks^[GS05].

Each LSTM block typically contains one recurrently connected memory cell, called an LSTM cell (to be distinguished from a neuron, which is also called a node or unit), but can contain multiple cells.

Fig. 1 illustrates the structure of an LSTM cell, which acts on the current input, $x_t$ , and the output of the preceding LSTM cell, $h_{t-1}$ .

The forget gate is a later addition^[GSC00] to the original LSTM design; it determines based on $x_t$ and $h_{t-1}$ the amount of information to be discarded from the cell state^[YSHZ19]:

$c_t = f_t \cdot c_{t-1} + i_t \cdot \tilde{c}_t,$

where $f_t = \sigma(W_{fh}h_{t-1} + W_{fx}x_t + b_f)$ and $\tilde{c}_t = g(W_{ch}h_{t-1} + W_{cx}x_t + b_c)$ . In the preceding equations,

$W_{fh}$ , $W_{fx}$ and $b_f$ are weights and bias associated with the forget gate;
$W_{ch}$ , $W_{cx}$ and $b_c$ are weights and bias associated with the cell;
$W_{ih}$ , $W_{ix}$ and $b_i$ are weights and bias associated with the input gate.

When the output of the forget gate, $f_t$ , is 1, all information in $c_{t-1}$ is retained, and when the output is zero, all information is discarded.

The cell output, $h_t$ , is the product:

$h_t = g(c_t) \cdot \sigma(W_{oh}h_{t-1} + W_{ox}x_t + b_o),$

where $W_{oh}$ , $W_{ox}$ and $b_o$ are the weights and bias associated with the output gate.

Fig. 1: An LSTM block with one memory cell, which contains a forget gate acting on the current input, $x_t$ , and the output of the preceding LSTM cell, $h_{t-1}$ . $c_{t-1}$ and $c_t$ are the *cell states* for the preceding cell and current cell respectively. While the forget gate scales the cell state, the input and output gates scale the input and output of the cell respectively. The activation functions $g$ and $h$ , also called squashing functions, are usually $\tanh$ . The multiplication represented by ⨀ is element-wise. Omitted from the diagram are the weights and bias associated with the 1️⃣ forget gate, 2️⃣ cell, 3️⃣ input gate, and 4️⃣ output gate. Diagram adapted from [VHMN20, Fig. 1], [YSHZ19, Figure 3] and [GS05, Fig. 1].

LSTM networks can be classified into two main types^{[YSHZ19, VHMN20]}:

LSTM-dominated networks

These are neural networks with LSTM cells as the dominant building blocks.

The design of these networks focuses on optimising the interconnections of the LSTM cells.

Examples include bidirectional LSTM networks, which are extensions of bidirectional RNNs.

The original bidirectional LSTM network^[GS05] uses a variation of backpropagation through time^[Wer90] for training.

Integrated LSTM networks

These are hybrid neural networks consisting of LSTM and non-LSTM layers.

The design of these networks focuses on integrating the strengths of the different types of layers.

For example, convolutional layers and LSTM layers have been integrated in a wide variety of ways.

Among the many possibilities, the CNN-LSTM architecture is widely used. It can for example be used to predict residential energy consumption^[KC19]:

Kim and Cho’s design^[KC19] consists of two convolutional-pooling layers, an LSTM layer and two fully connected (or dense) layers.
The convolutional-pooling layers extract features among several variables that affect energy consumption prediction.
The output of the convolutional-pooling layers is fed to the LSTM layer, after denoising, to extract temporal features. The LSTM layer can remember irregular trends.
The output of the LSTM layer is fed to two fully connected layers, the second of which generates a predicted time series of energy consumption.

References

[GBC16]	I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016. Available at https://www.deeplearningbook.org.
[Gra12]	A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks, Springer Berlin, Heidelberg, 2012. https://doi.org/10.1007/978-3-642-24797-2.
[GSC00]	F. A. Gers, J. Schmidhuber, and F. Cummins, Learning to forget: Continual prediction with LSTM, Neural Computation 12 no. 10 (2000), 2451–2471. https://doi.org/10.1162/089976600300015015.
[GSC05]	A. Graves and J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM networks, in 2005 IEEE International Joint Conference on Neural Networks, 4, 2005, pp. 2047–2052. https://doi.org/10.1109/IJCNN.2005.1556215.
[HS97]	S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation 9 no. 8 (1997), 1735–1780. https://doi.org/0.1162/neco.1997.9.8.1735.
[KC19]	T.-Y. Kim and S.-B. Cho, Predicting residential energy consumption using CNN-LSTM neural networks, Energy 182 (2019), 72–81. https://doi.org/10.1016/j.energy.2019.05.230.
[Mur22]	K. P. Murphy, Probabilistic Machine Learning: An Introduction, MIT Press, 2022. Available at http://probml.ai.
[VHMN20]	G. Van Houdt, C. Mosquera, and G. Nápoles, A review on the long short-term memory model, Artificial Intelligence Review 53 no. 8 (2020), 5929–5955. https://doi.org/10.1007/s10462-020-09838-1.
[Wer90]	P. Werbos, Backpropagation through time: what it does and how to do it, Proceedings of the IEEE 78 no. 10 (1990), 1550–1560. https://doi.org/10.1109/5.58337.
[YSHZ19]	Y. Yu, X. Si, C. Hu, and J. Zhang, A review of recurrent neural networks: LSTM cells and network architectures, Neural Computation 31 no. 7 (2019), 1235–1270. https://doi.org/10.1162/neco_a_01199.
[ZLLS23]	A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, Cambridge University Press, 2023. Available at https://d2l.ai/.

M

Machine learning (including deep learning)

by Yee Wei Law - Saturday, 17 May 2025, 5:13 PM

For COMP 5075 students, the Tutorial 3 page is more up-to-date.

Since the mid 2010s, advances in machine learning (ML) and particularly deep learning (DL), under the banner of artificial intelligence (AI), have been attracting not only media attention but also major capital investments.

The field of ML is decades old, but it was not until 2012, when deep neural networks (DNNs) emerged triumphant in the ImageNet image classification challenge [KSH17], that the field of ML truly took off.

DL is known to have approached or even exceeded human-level performance in many tasks.

DL techniques, especially DNN algorithms, are our main pursuit in this course, but before diving into them, we should get a clear idea about the differences among ML, DL and AI; starting with the subsequent definitions.

AI has the broadest and yet most elusive definition. There are four main schools of thought [RN22, Sec. 1.1; IBM23], namely 1️⃣ systems that think like humans, 2️⃣ systems that act like humans, 3️⃣ systems that think rationally, 4️⃣ systems that act rationally; but a sensible definition boils down to:

Definition 1: Artificial intelligence (AI) [RN22, Sec. 1.1.4]

The study and construction of rational agents that pursue their predefined objectives.

Above, a rational agent is one that acts so as to achieve the best outcome or, when there is uncertainty, the best expected outcome [RN22, Sec. 1.1.4].

The definition of AI above is referred to as the standard model of AI [RN22, Sec. 1.1.4].

ML is a subfield of AI:

Definition 2: Machine learning (ML) [Mit97, Sec. 1.1]

A computer program or machine is said to learn from experience $E$ with respect to some class of tasks $T$ , and performance measure $P$ , if its performance at tasks in $T$ , as measured by $P$ , improves with experience $E$ .

In the preceding definition, “experience”, “task” and “performance measure” require elaboration. Among the most common ML tasks are:

Classification: This is usually achieved through supervised learning (see Fig. 1), the aim of which is to learn a mapping $f$ from the input set $\mathcal{X}$ to the output set $\mathcal{Y}$ [ Mur22, Sec. 1.2; GBC16, Sec. 5.1.3; G22, Ch. 1], where
- every member of $\mathcal{X}$ is a vector of features, attributes, covariates, or predictors;
- every member of $\mathcal{Y}$ is a label, target, or response;
- each pair of input $x\in\mathcal{X}$ and associated output $y\in\mathcal{Y}$ is called an example.
A dataset containing $\mathcal{X}$ and $\mathcal{Y}$ used to “train” a model to predict/infer $y$ given some $x$ is called a training set; and this corresponds to experience $E$ in Definition 2.

When $\mathcal{Y}$ is a set of unordered and mutually exclusive labels known as classes, the supervised learning task becomes a classification task.

Classification of only two classes is called binary classification. For example, determining whether an email is spam or not is a binary classification task.

Fig. 1: Supervised learning [ZLLS23, Fig. 1.3.1].

Regression: Continuing from classification, if the output set $\mathcal{Y}$ is a continuous set of real values, rather than a discrete set, the classification task becomes a regression task.

For example, given the features (e.g., mileage, age, brand, model) and associated price for many examples of cars, a plausible regression task is to predict the price of a car given its features; see Fig. 2.

While the term “label” is more common in the classification context, “target” is more common in the regression context. In the earlier example, the target is the car price.
Clustering: This is the grouping of similar things together.

The Euclidean distance between two feature vectors can serve as a similarity measure, but depending on the problem, other similarity measures can be more suitable. In fact, many similarity measures have been proposed in the literature [GMW07, Ch. 6].

Clustering is a form of unsupervised learning.

From a probabilistic viewpoint, unsupervised learning is fitting an unconditional model of the form $\Pr\{x\}$ , which can generate new data $x$ , whereas supervised learning involves fitting a conditional model, $\Pr\{y|x\}$ , which specifies (a distribution over) outputs given inputs [Mur22, Sec. 1.3].
Anomaly detection: This is another form of unsupervised learning, and highly relevant to this course of ours.

We first encountered anomaly detection in Tutorial 1 on intrusion detection, and we will dive deep into anomaly detection in Tutorial 5 on unsupervised learning.

Common to the aforementioned tasks is the need to measure performance. An example of performance measure is

$\text{accuracy} = \frac{\text{number of correct predictions}}{\text{total number of predictions}}.$

Other performance measures will be investigated as part of Task 1.

DL is in turn a subfield of ML (see Fig. 3):

Definition 3: Deep learning (DL) [RN22, Sec. 1.3.8]

Machine learning using multiple layers of simple, adjustable computing elements.

Simply put, DL is the ever expanding body of ML techniques that leverage deep architectures (algorithmic structures consisting of many levels of nonlinear operations) for learning feature hierarchies, with features from higher levels of the hierarchy formed by composition of lower-level features [Ben09].

The rest of this tutorial attempts to 1️⃣ shed some light on why DNNs are superior to classical ML algorithms, 2️⃣ provide a brief tutorial on the original/shallow/artificial neural networks (ANNs), and 3️⃣ provide a preview of DNNs.

The good news with the topics of this tutorial is that there is such a vast amount of learning resources in the public domain, that even if the coverage here fails to satisfy your learning needs, there must be some resources out there that can.

References

[Agg18]	C. C. Aggarwal, Neural Networks and Deep Learning: A Textbook, Springer Cham, 2018, supplementary material at http://sn.pub/extras. https://doi.org/10.1007/978-3-319-94463-0.
[Ben09]	Y. Bengio, Learning Deep Architectures for AI, Foundations and Trends® in Machine Learning 2 no. 1 (2009), 1–127. https://doi.org/10.1561/2200000006.
[Cop16]	M. Copeland, What’s the difference between artificial intelligence, machine learning and deep learning?, NVIDIA blog, July 2016. Available at https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/.
[DG06]	J. Davis and M. Goadrich, The Relationship between Precision-Recall and ROC Curves, in Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, Association for Computing Machinery, 2006, p. 233 – 240. https://doi.org/10.1145/1143844.1143874.
[G22]	A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed., O’Reilly Media, Inc., 2022. Available at https://learning.oreilly.com/library/view/hands-on-machine-learning/9781098125967/.
[GBC16]	I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016. Available at http://www.deeplearningbook.org.
[GMW07]	G. Gan, C. Ma, and J. Wu, Data Clustering: Theory, Algorithms, and Applications, Society for Industrial and Applied Mathematics, 2007. https://doi.org/10.1137/1.9780898718348.
[Goo22a]	Google, Classification: Accuracy, Machine Learning Crash Course, July 2022. Available at https://developers.google.com/machine-learning/crash-course/classification/accuracy.
[Goo22b]	Google, Classification: Precision and Recall, Machine Learning Crash Course, July 2022. Available at https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall.
[Goo22c]	Google, Classification: ROC Curve and AUC, Machine Learning Crash Course, July 2022. Available at https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc.
[IBM23]	IBM, What is artificial intelligence (AI)?, IBM Topics, 2023. Available at https://www. ibm.com/topics/artificial-intelligence.
[KSH17]	A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM 60 no. 6 (2017), 84 – 90, journal version of the paper with the same name that appeared in the 25th International Conference on Neural Information Processing Systems in 2012. https://doi.org/10. 1145/3065386.
[LL19]	H. Liu and B. Lang, Machine learning and deep learning methods for intrusion detection systems: A survey, Applied Sciences 9 no. 20 (2019). https://doi.org/10.3390/app9204396.
[LXL+22]	X. Li, H. Xiong, X. Li, X. Wu, X. Zhang, J. Liu, J. Bian, and D. Dou, Interpretable deep learning: interpretation, interpretability, trustworthiness, and beyond, Knowledge and Information Systems 64 no. 12 (2022), 3197–3234. https://doi.org/10.1007/s10115-022-01756-8.
[Mit97]	T. C. Mitchell, Machine Learning, McGraw-Hill, 1997. Available at http://www.cs.cmu.edu/~tom/mlbook.html.
[Mur22]	K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.
[Mur23]	K. P. Murphy, Probabilistic Machine Learning: Advanced Topics, MIT Press, 2023. Available at http://probml.github.io/book2.
[NCS22]	NCSC, Principles for the security of machine learning, guidance from the National Cyber Security Centre, August 2022. Available at https://www.ncsc.gov.uk/files/Principles-for-the-security-of-machine-learning.pdf.
[PG17]	J. Patterson and A. Gibson, Deep Learning: A Practitioner’s Approach, O’Reilly Media, Inc., August 2017. Available at https://learning.oreilly.com/library/view/deep-learning/9781491924570/.
[RN22]	S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 4th ed., Pearson Education, 2022. Available at https://ebookcentral.proquest.com/lib/unisa/reader.action?docID=6563563.
[TBH+19]	E. Tabassi, K. J. Burns, M. Hadjimichael, A. D. Molina-Markham, and J. T. Sexton, A taxonomy and terminology of adversarial machine learning, Draft NISTIR 8269, National Institute of Standards and Technology, 2019. https://doi.org/10.6028/NIST.IR.8269-draft.
[ZLLS23]	A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, 2023, interactive online book, accessed 1 Jan 2023. Available at https://d2l.ai/.

P

Problems of vanishing gradients and exploding gradients

by Yee Wei Law - Monday, 27 January 2025, 4:46 PM

This knowledge base entry follows discussion of artificial neural networks and backpropagation.

The backpropagation (“backprop” for short) algorithm calculates gradients to update each weight.

Unfortunately, gradients often shrink as the algorithm progresses down to the lower layers, with the result that the lower layers’ weights remain virtually unchanged, and training fails to converge to a good solution — this is called the vanishing gradients problem [G22, Ch. 11].

The opposite can also happen: the gradients can keep growing until the layers get excessively large weight updates and the algorithm diverges — this is the exploding gradients problem [G22, Ch. 11].

Both problems plague deep neural networks (DNNs) and recurrent neural networks (RNNs) over very long sequences [Mur22, Sec. 13.4.2].

More generally, deep neural networks suffer from unstable gradients, and different layers may learn at widely different speeds.

Watch Prof Ng’s explanation of the problems:

The problems were observed decades ago and were the reasons why DNNs were mostly abandoned in the early 2000s [G22, Ch. 11].

The causes had been traced to the 1️⃣ usage of sigmoid activation functions, and 2️⃣ initialisation of weights to follow the zero-mean Gaussian distribution with standard deviation 1.
A sigmoid function saturates at 0 or 1, and when saturated, the derivative is nearly 0.
As a remedy, current best practices include using 1️⃣ a rectifier activation function, and 2️⃣ the weight initialisation algorithm called He initialisation.
He initialisation [HZRS15, Sec. 2.2]: at layer $\ell$ , weights follow the zero-mean Gaussian distribution with variance $2/n_\ell$ , where $n_\ell$ is the fan-in, or equivalently the number of inputs/weights feeding into layer $\ell$ .
He initialisation is implemented by the PyTorch function kaiming_normal_ and the Tensorflow function HeNormal.

Watch Prof Ng’s explanation of weight initialisation:

References

[G22]	A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed., O’Reilly Media, Inc., 2022. Available at https://learning.oreilly.com/library/view/hands-on-machine-learning/9781098125967/.
[HZRS15]	K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034. https://doi.org/10.1109/ICCV.2015.123.
[Mur22]	K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.

PyTorch

by Yee Wei Law - Saturday, 31 May 2025, 2:38 PM

Installation instructions:

Assuming the conda environment called pt (for “PyTorch”) does not yet exist, create and activate it using the commands:
```
conda create -n pt python=3.12
conda activate pt
```
Install the necessary conda packages:
conda install -c conda-forge jupyterlab jupyterlab-git lightning matplotlib nbdime pandas scikit-learn seaborn
Install the latest version of PyTorch (version 2.7 as of writing) through pip.

Option 1: If you have a CUDA-capable GPU (only NVIDIA's GPUs are so far), use the command below, where the string cu128 indicates CUDA version 12.8. If you have a a different version of CUDA, change 128 to reflect the version you have:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
Install the matching version of CUDA Toolkit from NVIDIA.

Option 2: If you do not have a CUDA-capable GPU, use the command below:
pip3 install torch torchvision torchaudio
Due to the total size of files to be installed, more than one installation attempt may be necessary.
Check if CUDA is available through PyTorch by running the command below in the command line:
python -c "import torch; print(torch.cuda.get_device_name() if torch.cuda.is_available() else 'No CUDA')"
The command above will print the name of your GPU if the preceding installation went successfully and you do have a CUDA-capable GPU.

R

Recurrent neural networks

by Yee Wei Law - Friday, 31 January 2025, 3:17 PM

A recurrent neural network (RNN) is a neural network which maps an input space of sequences to an output space of sequences in a stateful way^{[RHW86, Mur22]}.

While convolutional neural networks excel at two-dimensional (2D) data, recurrent neural networks (RNNs) are better suited for one-dimensional (1D), sequential data^{[GBC16, §9.11]}.

Unlike early artificial neural networks (ANNs) which have a feedforward structure, RNNs have a cyclic structure, inspired by the cyclical connectivity of neurons; see Fig. 1.

The forward pass of an RNN is the same as that of a multilayer perceptron, except that activations arrive at a hidden layer from both the current external input and the hidden-layer activations from the previous timestep.

Fig. 1 visualises the operation of an RNN by “unfolding” or “unrolling” the network across timesteps, with the same network parameters applied at each timestep.

Note: The term “timestep” should be understood more generally as an index for sequential data.

For the backward pass, two well-known algorithms are applicable: 1️⃣ real-time recurrent learning and the simpler, computationally more efficient 2️⃣ backpropagation through time^[Wer90].

Fig. 1: On the left, an RNN is often visualised as a neural network with recurrent connections. The recurrent connections should be understood, through unfolding or unrolling the network across timesteps, as applying the same network parameters to the current input and the previous state at each timestep. On the right, while the recurrent connections (blue arrows) propagate the network state over timesteps, the standard network connections (black arrows) propagate activations from one layer to the next within the same timestep. Diagram adapted from [ZLLS23, Figure 9.1].

Fig. 1 implies information flows in one direction, the direction associated with causality.

However, for many sequence labelling tasks, the correct output depends on the entire input sequence, or at least a sufficiently long input sequence. Examples of these tasks include speech recognition and language translation. Addressing the need of these tasks gave rise to bidirectional RNNs^[SP97].

Standard/traditional RNNs suffer from the following deficiencies^{[Gra12, YSHZ19, MSO24]}:

They are susceptible to the problems of vanishing gradients and exploding gradients.
They cannot store information for long periods of time.
Except for bidirectional RNNs, they access context information in only one direction (i.e., typically past information in the time domain).

Due to the drawbacks above, RNNs are typically used with “leaky” units enabling the networks to accumulate information over a long duration^{[GBC16, §10.10]}. The resultant RNNs are called gated RNNs. The most successful gated RNNs are those using long short-term memory (LSTM) or gated recurrent units (GRU).

References

[GBC16]	I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016. Available at https://www.deeplearningbook.org.
[Gra12]	A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks, Springer Berlin, Heidelberg, 2012. https://doi.org/10.1007/978-3-642-24797-2.
[MSO24]	I. D. Mienye, T. G. Swart, and G. Obaido, Recurrent neural networks: A comprehensive review of architectures, variants, and applications, Information 15 no. 9 (2024). https://doi.org/10.3390/info15090517.
[Mur22]	K. P. Murphy, Probabilistic Machine Learning: An Introduction, MIT Press, 2022. Available at http://probml.ai.
[RHW86]	D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representations by back-propagating errors, Nature 323 (1986), 533–536. https://doi.org/10.1038/323533a0.
[SP97]	M. Schuster and K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing 45 no. 11 (1997), 2673–2681. https://doi.org/10.1109/78.650093.
[VHMN20]	G. Van Houdt, C. Mosquera, and G. Nápoles, A review on the long short-term memory model, Artificial Intelligence Review 53 no. 8 (2020), 5929–5955. https://doi.org/10.1007/s10462-020-09838-1.
[Wer90]	P. Werbos, Backpropagation through time: what it does and how to do it, Proceedings of the IEEE 78 no. 10 (1990), 1550–1560. https://doi.org/10.1109/5.58337.
[YSHZ19]	Y. Yu, X. Si, C. Hu, and J. Zhang, A review of recurrent neural networks: LSTM cells and network architectures, Neural Computation 31 no. 7 (2019), 1235–1270. https://doi.org/10.1162/neco_a_01199.
[ZLLS23]	A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, Cambridge University Press, 2023. Available at https://d2l.ai/.

Reinforcement learning

by Yee Wei Law - Tuesday, 18 March 2025, 9:56 AM

Work in progress

Reinforcement learning (RL) is a family of algorithms that learn an optimal policy, whose goals is to maximize the expected return when interacting with an environment^[Goo25].

RL has existed since the 1950s^[BD10], but it was the introduction of high-capacity function approximators, namely deep neural networks, that rejuvenated RL in recent years^[LKTF20].

There are three main types of RL^{[LKTV20, FPMC24]}:

Online or on-policy RL: In this classic setting, an agent interacts freely with
Off-policy RL: In this classic setting, an agent
Offline RL:

References

[BD10]	R. Bellman and S. Dreyfus, Dynamic Programming, 33, Princeton University Press, 2010. https://doi.org/10.2307/j.ctv1nxcw0f.
[BK19]	S.L. Brunton and J.N. Kutz, Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control, Cambridge University Press, 2019. https://doi.org/10.1017/9781108380690.
[GBC16]	I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016. Available at https://www.deeplearningbook.org.
[Goo25]	Google, reinforcement learning (RL), Machine Learning Glossary, 2025, accessed 3 Jan 2025. Available at https://developers.google.com/machine-learning/glossary#reinforcement-learning-rl.
[LKTF20]	S. Levine, A. Kumar, G. Tucker, and J. Fu, Offline reinforcement learning: Tutorial, review, and perspectives on open problems, arXiv preprint arXiv:2005.01643, 2020. https://doi.org/10.48550/arXiv.2005.01643.
[FPMC24]	R. Figueiredo Prudencio, M.R.O.A. Maximo, and E.L. Colombini, A survey on offline reinforcement learning: Taxonomy, review, and open problems, IEEE Transactions on Neural Networks and Learning Systems 35 no. 8 (2024), 10237–10257. https://doi.org/10.1109/TNNLS.2023.3250269.
[SB18]	R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.
[ZLLS23]	A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, Cambridge University Press, 2023. Available at https://d2l.ai/.

S

Self-supervised learning

by Yee Wei Law - Tuesday, 25 April 2023, 10:20 AM

In his 2018 talk at EPFL and his AAAI 2020 keynote speech, Turing award winner Yann LeCun referred to self-supervised learning (SSL, not to be confused with Secure Socket Layer) as an algorithm that predicts any parts of its input for any observed part.

A standard definition of SSL remains as of writing elusive, but it is characterised by [LZH+23, p. 857]:

derivation of labels from data through a semi-automatic process;
prediction of parts of the data from other parts, where “other parts” could be incomplete, transformed, distorted or corrupted (see Fig. 1).

Fig. 1: Unlike supervised and unsupervised learning, in self-supervised learning, the related, co-occurring information in “Input 2” is used to derive training labels [LZH+23, Fig. 1]. This related information can be a different modality of “Input 1”, or parts of “Input 1”, or another form of “Input 1”.

SSL can be understood as learning to recover parts or some features of the original input, hence it is also called self-supervised representation learning.

SSL has two distinct phases (see Fig. 2):

unsupervised pre-training (which some authors [Mur22, Sec. 19.2.4] refer to as SSL itself), where a series of handcrafted auxiliary optimisation problems — called proxy tasks or pretext tasks — are solved to generate pseudo labels or supervisory signals from unlabelled data [LJP+22, Sec. 1]; and
knowledge transfer, where the pre-trained model is fine-tuned on labelled data — not only for performance improvement but also over-fitting reduction — for downstream tasks.

Fig. 2: The two-stage pipeline of SSL [JT21, Fig. 1]. Here, the convolutional neural network, ConvNet, is only an example of a machine learning algorithm used for the pretext and downstream tasks.

Different authors classify SSL algorithms slightly differently, but based on the pretext tasks, two distinct approaches are identifiable, namely generative and contrastive (see Fig. 2); the other approaches are either a hybrid of these two approaches, namely generative-contrastive / adversarial, or something else entirely.

Fig. 3: Generative, contrastive, as well as a hybrid of generative and contrastive pre-training [LZH+23, Fig. 4]. Generative pre-training does not involve a discriminator. A contrastive discriminator is usually lightweight (e.g., a two/three-layer multilayer perceptron), hence the label “(Light)”.

In Fig. 3, the generative (or generation-based) pre-training pipeline consists of a generator that 1️⃣ uses an encoder to encode input $x$ into an explicit vector $z$ , and a decoder to reconstruct $x$ from $z$ as $\hat{x}$ ; 2️⃣ is trained to minimise the reconstruction loss, which is a function of the difference between $x$ and $\hat{x}$ .

In Fig. 3, the contrastive (or contrast-based) pre-training pipeline consists of two components:

the generator uses an encoder to encode two versions of the input, namely $x$ and $y$ which can be related to each other through data augmentation, into two representations;
the discriminator computes the contrastive loss based on the difference between the two representations, so that the generator can be trained to minimise the contrastive loss.

Figs. 4-5 illustrate generative and contrastive pre-training in greater details using graph learning as the context.

Fig. 4: Applying generative SSL to graph learning [LJP+22, Fig. 3(a)]. “Representations” here is equivalent to $z$ in Fig. 3.

Fig. 5: Applying contrastive SSL to graph learning [LJP+22, Fig. 3(c)]. The two “Augmented Graphs” here correspond to $x$ and $y$ in Fig. 3. “Representations” here is equivalent to $z$ in Fig. 3. ⚠ To resolve discrepancies between Fig. 3 and Fig. 5, consider Fig. 5 to be right.

Fig. 5: Classification of self-supervised learning algorithms [LZH+23, Fig. 3].

An extensive list of references on SSL can be found on GitHub.

References

[JT21]	L. Jing and Y. Tian, Self-supervised visual feature learning with deep neural networks: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 no. 11 (2021), 4037–4058. https://doi.org/10.1109/TPAMI.2020.2992393.
[KNH+22]	S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, Transformers in vision: A survey, ACM Comput. Surv. 54 no. 10s (2022). https://doi.org/10.1145/3505244.
[LJP+22]	Y. Liu, M. Jin, S. Pan, C. Zhou, Y. Zheng, F. Xia, and P. Yu, Graph self-supervised learning: A survey, IEEE Transactions on Knowledge and Data Engineering (2022), early access. https://doi.org/10.1109/TKDE.2022.3172903.
[LZH+23]	X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, Self-supervised learning: Generative or contrastive, IEEE Transactions on Knowledge and Data Engineering 35 no. 1 (2023), 857–876. https://doi.org/10.1109/TKDE.2021.3090866.
[Mur22]	K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.

Keyword(s):

Standardising/standardisation and whitening

by Yee Wei Law - Tuesday, 20 June 2023, 7:34 AM

Given a dataset $\mathbf{X}\in\mathbb{R}^{N\times n}$ , where $N$ denotes the number of samples and $n$ denotes the number of features, it is a common practice to preprocess $\mathbf{X}$ so that each column has zero mean and unit variance; this is called standardising the data [Mur22, Sec. 7.4.5].

Standardising forces the variance (per column) to be 1 but does not remove correlation between columns.

Decorrelation necessitates whitening.

Whitening is a linear transformation of measurement $\vec{x}$ that produces decorrelated $\tilde{\vec{x}} = \mathbf{W}\vec{x}$ such that the covariance matrix $\mathbf{\Sigma} \triangleq \mathsf{E}\{\tilde{\vec{x}}\tilde{\vec{x}}^\top\} = \mathbf{I}$ , where $\mathbf{W}$ is called a whitening matrix [CPSK07, Sec. 2.5.3].

All $\tilde{\vec{x}}$ , $\mathbf{W}$ and $\mathbf{I}$ have the same number of rows, denoted $\ell$ , which satisfies $\ell\leq n$ ; if $\ell\lt n$ , then dimensionality reduction is also achieved besides whitening.

A whitening matrix can be obtained using eigenvalue decomposition:

$\mathbf{\Sigma} = \mathbf{EDE}^\top.$

where $\mathbf{E}$ is an orthogonal matrix containing the covariance matrix’s eigenvectors as its columns, and $\mathbf{D}$ is the diagonal matrix of the covariance matrix’s eigenvalues [ZX09, p. 74]. Based on the decomposition, the whitening matrix can be defined as

$\mathbf{W} = \mathbf{D}^{-/2}\mathbf{E}^\top.$

$\mathbf{W}$ above is called the PCA whitening matrix [Mur22, Sec. 7.4.5].

References

[CPSK07]	K. J. Cios, W. Pedrycz, R. W. Swiniarski, and L. A. Kurgan, Data Mining: A Knowledge Discovery Approach, Springer New York, NY, 2007. https://doi.org/10.1007/978-0-387-36795-8.
[Mur22]	K. P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press, 2022. Available at http://probml.ai.
[ZX09]	N. Zheng and J. Xue, Statistical Learning and Pattern Analysis for Image and Video Processing, Springer London, 2009. https://doi.org/10.1007/978-1-84882-312-9.

T

Transfer learning

by Yee Wei Law - Friday, 16 June 2023, 2:25 PM

References

[Mur22]	K. P. Murphy, Probabilistic Machine Learning: An Introduction, MIT Press, 2022. Available at http://probml.ai.
[ZQD+21]	F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, A comprehensive survey on transfer learning, Proceedings of the IEEE 109 no. 1 (2021), 43–76. https://doi.org/10.1109/JPROC.2020.3004555.

Keyword(s):

Transformer and attention

by Yee Wei Law - Sunday, 2 February 2025, 3:53 PM

References

[BCB15]	D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate, in ICLR, 2015. https://doi.org/10.48550/arXiv.1409.0473.
[BB24]	C. M. Bishop and H. Bishop, Deep Learning: Foundations and Concepts, Springer Cham, 2024. https://doi.org/10.1007/978-3-031-45468-4.
[VSP+17]	A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. U. Kaiser, and I. Polosukhin, Attention is all you need, in Advances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), 30, Curran Associates, Inc., 2017. Available at https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

Keyword(s):

Page: 1 2 3 (Next)
ALL