Special | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ALL
A |
---|
Activation function: contemporary options | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
This knowledge base entry follows discussion of artificial neural networks and backpropagation. Contemporary options for are the non-saturating activation functions [Mur22, Sec. 13.4.3], although the term is not accurate. Below, ( should be understood as the output of the summing junction.
References
| ||||||||||||||||||||
Active learning | ||
---|---|---|
Adversarial machine learning | ||||||
---|---|---|---|---|---|---|
Adversarial machine learning (AML) as a field can be traced back to [HJN+11]. The impact of adversarial examples on deep learning is well known within the computer vision community, and documented in a body of literature that has been growing exponentially since Szegedy et al.’s discovery [SZS+14]. The field is moving so fast that the taxonomy, terminology and threat models are still being standardised. See MITRE ATLAS. References
| ||||||
Artificial neural networks and backpropagation | |||
---|---|---|---|
See 👇 attachment or the latest source on Overleaf.
| |||
B |
---|
Batch normalisation (BatchNorm) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Watch a high-level explanation of BatchNorm: Watch more detailed explanation of BatchNorm by Prof Ng: Watch coverage of BatchNorm in Stanford 2016 course CS231n Lecture 5 Part 2: References
| ||||||||||
C |
---|
Cross-entropy loss | ||||
---|---|---|---|---|
[Cha19, pp. 11-14] References
| ||||
D |
---|
Domain adaptation | |||
---|---|---|---|
Domain adaptation is learning a discriminative classifier or other predictor in the presence of a shift of data distribution between the source/training domain and the target/test domain [GUA+16]. References | |||
Dropout | ||||
---|---|---|---|---|
Deep neural networks (DNNs) employ a large number of parameters to learn complex dependencies of outputs on inputs, but overfitting often occurs as a result. Large DNNs are also slow to converge. The dropout method implements the intuitive idea of randomly dropping units (along with their connections) from a network during training [SHK+14]. References
| ||||
F |
---|
Few-shot learning | ||||||
---|---|---|---|---|---|---|
References
| ||||||
P |
---|
Problems of vanishing gradients and exploding gradients | ||||||||
---|---|---|---|---|---|---|---|---|
This knowledge base entry follows discussion of artificial neural networks and backpropagation. The backpropagation (“backprop” for short) algorithm calculates gradients to update each weight. Unfortunately, gradients often shrink as the algorithm progresses down to the lower layers, with the result that the lower layers’ weights remain virtually unchanged, and training fails to converge to a good solution — this is called the vanishing gradients problem [G22, Ch. 11]. The opposite can also happen: the gradients can keep growing until the layers get excessively large weight updates and the algorithm diverges — this is the exploding gradients problem [G22, Ch. 11]. Both problems plague deep neural networks (DNNs) and recurrent neural networks (RNNs) over very long sequences [Mur22, Sec. 13.4.2]. More generally, deep neural networks suffer from unstable gradients, and different layers may learn at widely different speeds. Watch Prof Ng’s explanation of the problems: The problems were observed decades ago and were the reasons why DNNs were mostly abandoned in the early 2000s [G22, Ch. 11].
Watch Prof Ng’s explanation of weight initialisation: References
| ||||||||
PyTorch | ||
---|---|---|
Installation instructions:
| ||
S |
---|
Self-supervised learning | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
In his 2018 talk at EPFL and his AAAI 2020 keynote speech, Turing award winner Yann LeCun referred to self-supervised learning (SSL, not to be confused with Secure Socket Layer) as an algorithm that predicts any parts of its input for any observed part. A standard definition of SSL remains as of writing elusive, but it is characterised by [LZH+23, p. 857]:
SSL can be understood as learning to recover parts or some features of the original input, hence it is also called self-supervised representation learning. SSL has two distinct phases (see Fig. 2):
Different authors classify SSL algorithms slightly differently, but based on the pretext tasks, two distinct approaches are identifiable, namely generative and contrastive (see Fig. 2); the other approaches are either a hybrid of these two approaches, namely generative-contrastive / adversarial, or something else entirely. In Fig. 3, the generative (or generation-based) pre-training pipeline consists of a generator that 1️⃣ uses an encoder to encode input into an explicit vector , and a decoder to reconstruct from as ; 2️⃣ is trained to minimise the reconstruction loss, which is a function of the difference between and . In Fig. 3, the contrastive (or contrast-based) pre-training pipeline consists of two components:
Figs. 4-5 illustrate generative and contrastive pre-training in greater details using graph learning as the context. An extensive list of references on SSL can be found on GitHub. References
| ||||||||||||
Standardising/standardisation and whitening | ||||||||
---|---|---|---|---|---|---|---|---|
Given a dataset , where denotes the number of samples and denotes the number of features, it is a common practice to preprocess so that each column has zero mean and unit variance; this is called standardising the data [Mur22, Sec. 7.4.5]. Standardising forces the variance (per column) to be 1 but does not remove correlation between columns. Decorrelation necessitates whitening. Whitening is a linear transformation of measurement that produces decorrelated such that the covariance matrix , where is called a whitening matrix [CPSK07, Sec. 2.5.3]. All , and have the same number of rows, denoted , which satisfies ; if , then dimensionality reduction is also achieved besides whitening. A whitening matrix can be obtained using eigenvalue decomposition: where is an orthogonal matrix containing the covariance matrix’s eigenvectors as its columns, and is the diagonal matrix of the covariance matrix’s eigenvalues [ZX09, p. 74]. Based on the decomposition, the whitening matrix can be defined as above is called the PCA whitening matrix [Mur22, Sec. 7.4.5]. References
| ||||||||
T |
---|
Transfer learning | ||||||
---|---|---|---|---|---|---|
References
| ||||||
Transformer | |||
---|---|---|---|