Special | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ALL
A |
---|
Activation function: contemporary options | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
This knowledge base entry follows discussion of artificial neural networks and backpropagation. Contemporary options for are the non-saturating activation functions [Mur22, Sec. 13.4.3], although the term is not accurate. Below, should be understood as the output of the summing junction.
References
| ||||||||||||||||||||
Active learning | ||
---|---|---|
Adversarial machine learning | ||||||||
---|---|---|---|---|---|---|---|---|
Adversarial machine learning (AML) as a field can be traced back to [HJN+11]. AML is the study of 1️⃣ the capabilities of attackers and their goals, as well as the design of attack methods that exploit the vulnerabilities of ML during the ML life cycle; 2️⃣ the design of ML algorithms that can withstand these security and privacy challenges [OV24]. The impact of adversarial examples on deep learning is well known within the computer vision community, and documented in a body of literature that has been growing exponentially since Szegedy et al.’s discovery [SZS+14]. The field is moving so fast that the taxonomy, terminology and threat models are still being standardised. See MITRE ATLAS. References
| ||||||||
Apache MXNet | ||
---|---|---|
Deep learning library Apache MXNet reached version 1.9.1 when it was retired in 2023. Despite its obsolescence, there are MXNet-based projects that have not yet been ported to other libraries. In the process of porting these projects, it is useful to be able to evaluate their performance in MXNet, and hence it is useful to be able to set up MXNet. The problem is the dependencies of MXNet have not been updated for a while, and installation is not as straightforward as the installation guide makes it out to be. The installation guide here is applicable to Ubuntu 24.04 LTS on WSL2 and requires
After setting up all the above, do pip install mxnet-cu112
Some warnings like this will appear but are inconsequential: cuDNN lib mismatch: linked-against version 8907 != compiled-against version 8101. Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning. | ||
Artificial neural networks and backpropagation | |||
---|---|---|---|
See 👇 attachment.
| |||
Autoencoders | ||||||
---|---|---|---|---|---|---|
An autoencoder References
| ||||||
B |
---|
Batch normalisation (BatchNorm) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Watch a high-level explanation of BatchNorm: Watch more detailed explanation of BatchNorm by Prof Ng: Watch coverage of BatchNorm in Stanford 2016 course CS231n Lecture 5 Part 2: References
| ||||||||||
C |
---|
Convolutional neural networks | ||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
The convolutional neural network (ConvNet or CNN) is an evolution of the multilayer perceptron that replaces matrix multiplication with convolution in at least one layer[GBC16, Ch. 9]. A deep CNN is a CNN that has more than three hidden layers. CNNs play an important role in the history of deep learning because[GBC16, §9.11]:
The following discusses the basic CNN structure and zooms in on the structural elements. StructureThe core CNN structure is inspired by the visual cortex of cats. The receptive field of a cell in the visual system may be defined as the region of retina (or visual field) over which one can influence the firing of that cell[HW62]. In a cat’s visual cortex, the majority of receptive fields can be classified as either simple or complex[HW62, LB02, Lin21]:
As a predecessor to the CNN, Fukushima’s neural network model “neocognitron”[Fuk80], as shown in Fig. 2, is a hierarchy of alternating layers of “S-cells” (modelling simple cells) and “C-cells” (modelling complex cells). The neocognitron performs layer-wise unsupervised learning (clustering to be specific)[GBC16, §9.10], such that none of the C-cells in the last layer responds to more than one stimulus pattern[Fuk80]. Furthermore, the response is invariant to the pattern’s position, small changes in shape or size[Fuk80]. Inspired by the neocognitron, the core CNN structure, as shown in Fig. 2, has convolution layers that mimic the behavior of S-cells, and pooling layers that mimic the behavior of C-cells. The output of a convolution layer is called a feature map[ZLLS23, §7.2.6]. Not shown in Fig. 2. is a nonlinear activation (e.g., ReLU) layer between the convolution layer and pooling layer; these three layers implement the three-stage processing that characterizes the CNN[GBC16, §9.3]. The nonlinear activation stage is sometimes called the detector stage[GBC16, §9.3]. Invariance to local translation is useful if detecting the presence of a feature is more important than localizing the feature. When it was introduced, the CNN brought three architectural innovations to achieve shift/translation-invariance[LB02, GBC16]:
Structural element: convolutionIt has to be emphasised that the convolution in this context is inspired by but not equivalent to the convolution in linear system theory and furthermore it exists in several variants[GBC16, Ch. 9], but it is still a linear operation. In signal processing, if denotes a time-dependent input and denotes the impulse response function of a linear system, then the output response of the system is given by the convolution (denoted by symbol ) of and : In discrete time, the equation above can be written as Above, square brackets are used to distinguish discrete time from continuous time. In machine learning (ML), is called a kernel or filter and the output of convolution is an example of a feature map. For two-dimensional (2D) inputs (e.g., images), we use 2D kernels in convolution: Convolution works similarly to windowed (Gabor) Fourier transforms[BK19, §6.5] and wavelet transforms[Mal16]. Convolution is not to be confused with cross-correlation, but for efficiency, convolution is often implemented as cross-correlation in ML libraries. For example, PyTorch implements 2D convolution ( Note above, the indices of are and instead of and . The absence of flipping of the indices makes cross-correlation more efficient. Both convolution and cross-correlation are invariant and equivariant to shift/translation. Both convolution and cross-correlation is neither scale- nor rotation-invariant[GBC16]. Instead of convolution, CNNs typically use cross-correlation because ... Fig. 4 animates an example of a cross-correlation operation using an edge detector filter. To an image, applying an filter at a stride of produces a feature map of size assuming is a multiple of [Mad21]. Structural element: poolingA pooling function replaces the output of a neural network at a certain location with a summary statistic (e.g., maximum, average) of the nearby outputs[GBC16, §9.3]. The number of these nearby outputs is the pool width. Like a convolution/cross-correlation filter, a pooling window is slid over the input one stride at a time. Fig. 5 illustrates the maximum-pooling (max-pooling for short) operation[RP99]. Pooling helps make a feature map approximately invariant to small translations of the input. Furthermore, for many tasks, pooling is essential for handling inputs of varying size. Fig. 6 illustrates the role of max-pooling in downsampling. As a summary, Fig. 7 shows the structure of an example of an early CNN[LB02]. Note though the last layers of a modern CNN are typically fully connected layers (also known as dense layers). References
| ||||||||||||||||||||||||||||||||
Cross-entropy loss | ||||
---|---|---|---|---|
[Cha19, pp. 11-14] References
| ||||
D |
---|
Domain adaptation | |||
---|---|---|---|
Domain adaptation is learning a discriminative classifier or other predictor in the presence of a shift of data distribution between the source/training domain and the target/test domain [GUA+16]. References | |||
Dropout | ||||
---|---|---|---|---|
Deep neural networks (DNNs) employ a large number of parameters to learn complex dependencies of outputs on inputs, but overfitting often occurs as a result. Large DNNs are also slow to converge. The dropout method implements the intuitive idea of randomly dropping units (along with their connections) from a network during training [SHK+14]. References
| ||||
F |
---|
Few-shot learning | ||||||
---|---|---|---|---|---|---|
References
| ||||||
I |
---|
Invariance and equivariance | ||||
---|---|---|---|---|
A function of an input is invariant to a transformation if In other words, function in invariant to transformation if produces the same output regardless of the output of [Pri23, §10.1]. For example, an image classifier should be invariant to geometric transformations of an image. A function of an input is equivariant or covariant to a transformation if In other words, function is equivariant or covariant to transformation if the output of changes in the same way under as the input[Pri23, §10.1]. For example, when an input image is geometrically transformed in some way, the output of an image segmentation algorithm should be transformed in the same way. References
| ||||
K |
---|
Kats by Facebook Research | ||
---|---|---|
Facebook Research’s Kats has been billed as “one-stop shop for time series analysis in Python”. Kats supports standard time-series analyses, e.g., forecasting, anomaly/outlier detection. The official installation instructions however do not work out of the box. At the time of writing, the official instructions lead to the error message “python setup.py bdist_wheel did not run successfully” due to incompatibility with the latest version of Python. Based on community responses to the error message, and based on my personal experience, the following instructions work:
The following sample code should run error-free:
| ||
L |
---|
Long short-term memory (LSTM) | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A long short-term memory (LSTM) network is a type of recurrent neural network (RNN) designed to address the problems of vanishing gradients and exploding gradients using gradient truncation and structures called “constant error carousels” for enforcing constant (as opposed to vanishing or exploding) error flow[HS97]. LSTM solves such a fundamental problem with traditional RNNs that most of the state-of-the-art results achieved through RNNs can be attributed to LSTM[YSHZ19]. An LSTM network replaces the traditional neural network layers with LSTM layers, each of which consists of a set of recurrently connected, differentiable memory blocks[GS05]. Each LSTM block typically contains one recurrently connected memory cell, called an LSTM cell (to be distinguished from a neuron, which is also called a node or unit), but can contain multiple cells. Fig. 1 illustrates the structure of an LSTM cell, which acts on the current input, , and the output of the preceding LSTM cell, . The forget gate is a later addition[GSC00] to the original LSTM design; it determines based on and the amount of information to be discarded from the cell state[YSHZ19]: where and . In the preceding equations,
When the output of the forget gate, , is 1, all information in is retained, and when the output is zero, all information is discarded. The cell output, , is the product: where , and are the weights and bias associated with the output gate. LSTM networks can be classified into two main types[YSHZ19, VHMN20]: LSTM-dominated networks These are neural networks with LSTM cells as the dominant building blocks. The design of these networks focuses on optimising the interconnections of the LSTM cells. Examples include bidirectional LSTM networks, which are extensions of bidirectional RNNs. The original bidirectional LSTM network[GS05] uses a variation of backpropagation through time[Wer90] for training. Integrated LSTM networks These are hybrid neural networks consisting of LSTM and non-LSTM layers. The design of these networks focuses on integrating the strengths of the different types of layers. For example, convolutional layers and LSTM layers have been integrated in a wide variety of ways. Among the many possibilities, the CNN-LSTM architecture is widely used. It can for example be used to predict residential energy consumption[KC19]:
References
| ||||||||||||||||||||||||
M |
---|
Machine learning (including deep learning) | ||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
For COMP 5075 students, the Tutorial 3 page is more up-to-date. Since the mid 2010s, advances in machine learning (ML) and particularly deep learning (DL), under the banner of artificial intelligence (AI), have been attracting not only media attention but also major capital investments. The field of ML is decades old, but it was not until 2012, when deep neural networks (DNNs) emerged triumphant in the ImageNet image classification challenge [KSH17], that the field of ML truly took off. DL is known to have approached or even exceeded human-level performance in many tasks. DL techniques, especially DNN algorithms, are our main pursuit in this course, but before diving into them, we should get a clear idea about the differences among ML, DL and AI; starting with the subsequent definitions. AI has the broadest and yet most elusive definition. There are four main schools of thought [RN22, Sec. 1.1; IBM23], namely 1️⃣ systems that think like humans, 2️⃣ systems that act like humans, 3️⃣ systems that think rationally, 4️⃣ systems that act rationally; but a sensible definition boils down to: Definition 1: Artificial intelligence (AI) [RN22, Sec. 1.1.4]
The study and construction of rational agents that pursue their predefined objectives. Above, a rational agent is one that acts so as to achieve the best outcome or, when there is uncertainty, the best expected outcome [RN22, Sec. 1.1.4]. The definition of AI above is referred to as the standard model of AI [RN22, Sec. 1.1.4]. ML is a subfield of AI: Definition 2: Machine learning (ML) [Mit97, Sec. 1.1]
In the preceding definition, “experience”, “task” and “performance measure” require elaboration. Among the most common ML tasks are:
Common to the aforementioned tasks is the need to measure performance. An example of performance measure is Other performance measures will be investigated as part of Task 1. DL is in turn a subfield of ML (see Fig. 3): Definition 3: Deep learning (DL) [RN22, Sec. 1.3.8]
Machine learning using multiple layers of simple, adjustable computing elements. Simply put, DL is the ever expanding body of ML techniques that leverage deep architectures (algorithmic structures consisting of many levels of nonlinear operations) for learning feature hierarchies, with features from higher levels of the hierarchy formed by composition of lower-level features [Ben09]. The rest of this tutorial attempts to 1️⃣ shed some light on why DNNs are superior to classical ML algorithms, 2️⃣ provide a brief tutorial on the original/shallow/artificial neural networks (ANNs), and 3️⃣ provide a preview of DNNs. The good news with the topics of this tutorial is that there is such a vast amount of learning resources in the public domain, that even if the coverage here fails to satisfy your learning needs, there must be some resources out there that can. References
| ||||||||||||||||||||||||||||||||||||||||||||||
P |
---|
Problems of vanishing gradients and exploding gradients | ||||||||
---|---|---|---|---|---|---|---|---|
This knowledge base entry follows discussion of artificial neural networks and backpropagation. The backpropagation (“backprop” for short) algorithm calculates gradients to update each weight. Unfortunately, gradients often shrink as the algorithm progresses down to the lower layers, with the result that the lower layers’ weights remain virtually unchanged, and training fails to converge to a good solution — this is called the vanishing gradients problem [G22, Ch. 11]. The opposite can also happen: the gradients can keep growing until the layers get excessively large weight updates and the algorithm diverges — this is the exploding gradients problem [G22, Ch. 11]. Both problems plague deep neural networks (DNNs) and recurrent neural networks (RNNs) over very long sequences [Mur22, Sec. 13.4.2]. More generally, deep neural networks suffer from unstable gradients, and different layers may learn at widely different speeds. Watch Prof Ng’s explanation of the problems: The problems were observed decades ago and were the reasons why DNNs were mostly abandoned in the early 2000s [G22, Ch. 11].
Watch Prof Ng’s explanation of weight initialisation: References
| ||||||||
PyTorch | ||
---|---|---|
Installation instructions:
| ||
R |
---|
Recurrent neural networks | ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A recurrent neural network (RNN) is a neural network which maps an input space of sequences to an output space of sequences in a stateful way[RHW86, Mur22]. While convolutional neural networks excel at two-dimensional (2D) data, recurrent neural networks (RNNs) are better suited for one-dimensional (1D), sequential data[GBC16, §9.11]. Unlike early artificial neural networks (ANNs) which have a feedforward structure, RNNs have a cyclic structure, inspired by the cyclical connectivity of neurons; see Fig. 1. The forward pass of an RNN is the same as that of a multilayer perceptron, except that activations arrive at a hidden layer from both the current external input and the hidden-layer activations from the previous timestep. Fig. 1 visualises the operation of an RNN by “unfolding” or “unrolling” the network across timesteps, with the same network parameters applied at each timestep. Note: The term “timestep” should be understood more generally as an index for sequential data. For the backward pass, two well-known algorithms are applicable: 1️⃣ real-time recurrent learning and the simpler, computationally more efficient 2️⃣ backpropagation through time[Wer90]. Fig. 1 implies information flows in one direction, the direction associated with causality. However, for many sequence labelling tasks, the correct output depends on the entire input sequence, or at least a sufficiently long input sequence. Examples of these tasks include speech recognition and language translation. Addressing the need of these tasks gave rise to bidirectional RNNs[SP97]. Standard/traditional RNNs suffer from the following deficiencies[Gra12, YSHZ19, MSO24]:
Due to the drawbacks above, RNNs are typically used with “leaky” units enabling the networks to accumulate information over a long duration[GBC16, §10.10]. The resultant RNNs are called gated RNNs. The most successful gated RNNs are those using long short-term memory (LSTM) or gated recurrent units (GRU). References
| ||||||||||||||||||||||
Reinforcement learning | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Work in progress Reinforcement learning (RL) is a family of algorithms that learn an optimal policy, whose goals is to maximize the expected return when interacting with an environment[Goo25]. RL has existed since the 1950s[BD10], but it was the introduction of high-capacity function approximators, namely deep neural networks, that rejuvenated RL in recent years[LKTF20]. There are three main types of RL[LKTV20, FPMC24]:
References
| ||||||||||||||||||
S |
---|
Self-supervised learning | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
In his 2018 talk at EPFL and his AAAI 2020 keynote speech, Turing award winner Yann LeCun referred to self-supervised learning (SSL, not to be confused with Secure Socket Layer) as an algorithm that predicts any parts of its input for any observed part. A standard definition of SSL remains as of writing elusive, but it is characterised by [LZH+23, p. 857]:
SSL can be understood as learning to recover parts or some features of the original input, hence it is also called self-supervised representation learning. SSL has two distinct phases (see Fig. 2):
Different authors classify SSL algorithms slightly differently, but based on the pretext tasks, two distinct approaches are identifiable, namely generative and contrastive (see Fig. 2); the other approaches are either a hybrid of these two approaches, namely generative-contrastive / adversarial, or something else entirely. In Fig. 3, the generative (or generation-based) pre-training pipeline consists of a generator that 1️⃣ uses an encoder to encode input into an explicit vector , and a decoder to reconstruct from as ; 2️⃣ is trained to minimise the reconstruction loss, which is a function of the difference between and . In Fig. 3, the contrastive (or contrast-based) pre-training pipeline consists of two components:
Figs. 4-5 illustrate generative and contrastive pre-training in greater details using graph learning as the context. An extensive list of references on SSL can be found on GitHub. References
| ||||||||||||
Standardising/standardisation and whitening | ||||||||
---|---|---|---|---|---|---|---|---|
Given a dataset , where denotes the number of samples and denotes the number of features, it is a common practice to preprocess so that each column has zero mean and unit variance; this is called standardising the data [Mur22, Sec. 7.4.5]. Standardising forces the variance (per column) to be 1 but does not remove correlation between columns. Decorrelation necessitates whitening. Whitening is a linear transformation of measurement that produces decorrelated such that the covariance matrix , where is called a whitening matrix [CPSK07, Sec. 2.5.3]. All , and have the same number of rows, denoted , which satisfies ; if , then dimensionality reduction is also achieved besides whitening. A whitening matrix can be obtained using eigenvalue decomposition: where is an orthogonal matrix containing the covariance matrix’s eigenvectors as its columns, and is the diagonal matrix of the covariance matrix’s eigenvalues [ZX09, p. 74]. Based on the decomposition, the whitening matrix can be defined as above is called the PCA whitening matrix [Mur22, Sec. 7.4.5]. References
| ||||||||
T |
---|
Transfer learning | ||||||
---|---|---|---|---|---|---|
References
| ||||||
Transformer and attention | ||||||||
---|---|---|---|---|---|---|---|---|
References
| ||||||||