Browse the glossary using this index

Special | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ALL

C

Picture of Yee Wei Law

Convolutional neural networks

by Yee Wei Law - Thursday, 9 January 2025, 3:31 PM
 

The convolutional neural network (ConvNet or CNN) is an evolution of the multilayer perceptron that replaces matrix multiplication with convolution in at least one layer[GBC16, Ch. 9].

A deep CNN is a CNN that has more than three hidden layers.

CNNs play an important role in the history of deep learning because[GBC16, §9.11]:

  • They exemplify successful applications of neuroscientific insights[Lin21] to machine learning.
  • CNNs are among the first neural networks to achieve commercial success; for example, AT&T used the LeNet-5 CNN to read bank checks in the 1990s[LBBH98].
  • Deep CNNs are among the first to be successfully trained with backpropagation.
  • Deep CNNs are among the first deep neural networks to achieve state-of-the-art results for challenging problems; for example, the ground-breaking performance of the 8-layer CNN called AlexNet in the 2012 ImageNet challenge[KSH17] is widely considered to be the watershed event that propelled deep learning research.
  • Deep CNNs remain pertinent for contemporary applications; for example, a combination of CNN and LSTM has been applied to predicting aircraft trajectory for assisting air crews in their decision-making during the approach phase of a flight[LPM21].

The following discusses the basic CNN structure and zooms in on the structural elements.

Structure

The core CNN structure is inspired by the visual cortex of cats.

The receptive field of a cell in the visual system may be defined as the region of retina (or visual field) over which one can influence the firing of that cell[HW62]. In a cat’s visual cortex, the majority of receptive fields can be classified as either simple or complex[HW62, LB02, Lin21]:

  • The simple cells respond to bars of light or dark when placed at specific spatial locations.

    For each cell, there is an orientation of the bar at which the cell fires the most, with its response declining as the angle of the bar changes from the optimal/preferred orientation.

    In a nutshell, the simple cells are locally sensitive and orientation-selective.

  • The complex cells have less strict response profiles.

    These cells are also sensitive to the bar’s orientation, but can respond just as strongly to a bar in several different nearby locations.

    These complex cells receive input from several simple cells, all with the same preferred orientation but with slightly different preferred locations. In other words, the response of the complex cells is shift/translation-invariant; see Fig. 1.

Fig. 1: In an experiment conducted by Hubel and Wiesel on complex cells[HW62, p. 119], a dark bar was placed against a bright background. Vigorous firing was observed regardless of the position of the bar, provided the bar was horizontal and within the receptive field (A-C). If the bar was tipped more than 10° in either direction, no firing was observed (D-E). Diagram from [HW62, Text-fig. 7].
Fig. 2: The structure on the left depicts a “neocognitron”, which is a hierarchy of afferent S-cells (simple cells) feeding into C-cells (complex cells). The S-cells have preferred locations (dashed ovals) in the image, where they respond strongly to bars of preferred orientations. The C-cells collect inputs from the S-cells and exhibit more spatially invariant responses. The structure on the right shows the core CNN structure mirroring the neocognitron, which consists of an input layer connected to a convolutional layer connected to a pooling layer. Diagram from [Lin21, Figure 1].

As a predecessor to the CNN, Fukushima’s neural network model “neocognitron”[Fuk80], as shown in Fig. 2, is a hierarchy of alternating layers of “S-cells” (modelling simple cells) and “C-cells” (modelling complex cells).

The neocognitron performs layer-wise unsupervised learning (clustering to be specific)[GBC16, §9.10], such that none of the C-cells in the last layer responds to more than one stimulus pattern[Fuk80]. Furthermore, the response is invariant to the pattern’s position, small changes in shape or size[Fuk80].

Inspired by the neocognitron, the core CNN structure, as shown in Fig. 2, has convolution layers that mimic the behavior of S-cells, and pooling layers that mimic the behavior of C-cells.

The output of a convolution layer is called a feature map[ZLLS23, §7.2.6].

Not shown in Fig. 2. is a nonlinear activation (e.g., ReLU) layer between the convolution layer and pooling layer; these three layers implement the three-stage processing that characterizes the CNN[GBC16, §9.3].

The nonlinear activation stage is sometimes called the detector stage[GBC16, §9.3].

Invariance to local translation is useful if detecting the presence of a feature is more important than localizing the feature.

When it was introduced, the CNN brought three architectural innovations to achieve shift/translation-invariance[LB02, GBC16]:

  1. Local receptive fields (or sparse interactions): In traditional neural networks, every output unit is connected to every input unit through matrix multiplication.

    For any element of some layer, its receptive field refers to all the elements (from all the previous layers) that may affect the calculation of during forward propagation[ZLLS23, §7.2.6].

    CNNs force the extraction of local features by restricting the receptive fields of hidden units to be local and only as large as the size of a kernel/filter (see next section). In other words, CNNs enforce sparse interactions; see Fig. 3.

    Thus, compared to traditional neural networks, CNNs 1️⃣ need less memory because there are less parameters/weights to store, 2️⃣ have better statistical efficiency, 3️⃣ are more computationally efficient.

  2. Shared weights (or tied weights or weight replication or parameter sharing): This refers to using the same parameter/weight for more than one function in a model.

    More concretely, the value of a weight applied to one input is tied to the value of a weight applied elsewhere. This happens because each element of a kernel/filter is applied to every element of the input (every pixel if the input is an image), barring some boundary elements.

    In contrast, for a traditional neural network, each element of the weight matrix is used exactly once when computing the output of a layer.

  3. Pooling (or subsampling): This is discussed in the last section. That is also where a complete example of a CNN is shown.
Fig. 3: The sparse connectivity of a CNN (top) vs the full connectivity of a traditional neural network. Consider at the top, its receptive field consists of only . The size of the 1D convolution kernel/filter is 3 in this case. Diagram from [GBC16, Figure 9.3].

Structural element: convolution

It has to be emphasised that the convolution in this context is inspired by but not equivalent to the convolution in linear system theory and furthermore it exists in several variants[GBC16, Ch. 9], but it is still a linear operation.

In signal processing, if denotes a time-dependent input and denotes the impulse response function of a linear system, then the output response of the system is given by the convolution (denoted by symbol ) of and :

In discrete time, the equation above can be written as

Above, square brackets are used to distinguish discrete time from continuous time.

In machine learning (ML), is called a kernel or filter and the output of convolution is an example of a feature map.

For two-dimensional (2D) inputs (e.g., images), we use 2D kernels in convolution:

Convolution works similarly to windowed (Gabor) Fourier transforms[BK19, §6.5] and wavelet transforms[Mal16].

Convolution is not to be confused with cross-correlation, but for efficiency, convolution is often implemented as cross-correlation in ML libraries. For example, PyTorch implements 2D convolution (Conv2d) as cross-correlation (denoted by symbol ):

Note above, the indices of are and instead of and . The absence of flipping of the indices makes cross-correlation more efficient.

Both convolution and cross-correlation are invariant and equivariant to shift/translation.

Both convolution and cross-correlation is neither scale- nor rotation-invariant[GBC16].

Instead of convolution, CNNs typically use cross-correlation because ...

Fig. 4 animates an example of a cross-correlation operation using an edge detector filter.

To an image, applying an filter at a stride of produces a feature map of size

assuming is a multiple of [Mad21].

Fig. 4: An example of a cross-correlation operation [Mad21], where a 3x3 edge detector filter is applied to a 5x5 image. Starting from the upper left corner, the filter is slid rightwards by 1 pixel or downwards by 1 pixel at a time; in other words, the operation has a stride of 1.

Structural element: pooling

A pooling function replaces the output of a neural network at a certain location with a summary statistic (e.g., maximum, average) of the nearby outputs[GBC16, §9.3]. The number of these nearby outputs is the pool width. Like a convolution/cross-correlation filter, a pooling window is slid over the input one stride at a time.

Fig. 5 illustrates the maximum-pooling (max-pooling for short) operation[RP99].

Pooling helps make a feature map approximately invariant to small translations of the input. Furthermore, for many tasks, pooling is essential for handling inputs of varying size.

Fig. 6 illustrates the role of max-pooling in downsampling.

As a summary, Fig. 7 shows the structure of an example of an early CNN[LB02]. Note though the last layers of a modern CNN are typically fully connected layers (also known as dense layers).

Fig. 5: (Top) Max-pooling is applied to the detector stage 3 units at a time and at a stride of 1. (Bottom) Even when all the units in the detector stage change in value, only one unit in the pool stage changes in value, exhibiting a high degree of invariance. Diagram from [GBC16, Figure 9.8].
Fig. 6: Like Fig. 5, a pool width of 3 is used (except for the rightmost pool), but here, the stride is 2, halving the size of the feature map. Diagram from [GBC16, Figure 9.10].
Fig. 7: An example of an early CNN using 5x5 convolution filters for recognizing handwriting. The dimension of the first feature map is (28-5+1=24)x(24). For the first convolution layer, 4 different filters are used, resulting in 4 feature maps. The subsampling layers are obtained by applying 2x2 average-pooling. The 26 output units correspond to 26 alphabets. Diagram from [LB02, Figure 1].

References

[BK19] S.L. Brunton and J.N. Kutz, Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control, Cambridge University Press, 2019. https://doi.org/10.1017/9781108380690.
[Fuk80] K. Fukushima, Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biological Cybernetics 36 no. 4 (1980), 193–202. https://doi.org/10.1007/BF00344251.
[GBC16] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016. Available at https://www.deeplearningbook.org.
[GK22] P. Grohs and G. Kutyniok, Mathematical Aspects of Deep Learning, Cambridge University Press, 2022. https://doi.org/10.1017/9781009025096.
[HW62] D. H. Hubel and T. N. Wiesel, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, J. Physiol. 160 (1962), 106–154. https://doi.org/10.1113/jphysiol.1962.sp006837.
[KSH17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM 60 no. 6 (2017), 84 – 90, journal version of the paper with the same name that appeared in NIPS 2012. https://doi.org/10.1145/306538.
[LBBH98] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86" no. 11 (1998), 2278–2324. https://doi.org/10.1109/5.726791.
[LB02] Y. LeCun and Y. Bengio, Convolutional networks for images, speech, and time series, in The Handbook of Brain Theory and Neural Networks (M. A. Arbib, ed.), MIT Press, 2nd ed., 2002, 1st edition in 1995, p. 276–279. https://doi.org/10.7551/mitpress/3413.001.0001.
[LPM21] H. Lee, T. G. Puranik, and D. N. Mavris, Deep spatio-temporal neural networks for risk prediction and decision support in aviation operations, Journal of Computing and Information Science in Engineering 21 no. 4 (2021), 041013. https://doi.org/10.1115/1.4049992.
[Lin21] G. W. Lindsay, Convolutional neural networks as a model of the visual system: Past, present, and future, Journal of Cognitive Neuroscience 33 no. 10 (2021), 2017–2031. https://doi.org/10.1162/jocn_a_01544.
[Mad21] S. Madhavan, Introduction to convolutional neural networks: Explore the different steps that go into creating a convolutional neural network, IBM Developer article, 2021. Available at https://developer.ibm.com/articles/introduction-to-convolutional-neural-networks/.
[Mal16] S. Mallat, Understanding deep convolutional networks, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374 no. 2065 (2016), 20150203. https://doi.org/10.1098/rsta.2015.0203.
[RP99] M. Riesenhuber and T. Poggio, Hierarchical models of object recognition in cortex, Nature Neuroscience 2 no. 11 (1999), 1019–1025. https://doi.org/10.1038/14819.
[SB18] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.
[ZLLS23] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, Cambridge University Press, 2023. Available at https://d2l.ai/.

Picture of Yee Wei Law

Cross-entropy loss

by Yee Wei Law - Friday, 31 March 2023, 1:40 PM
 

[Cha19, pp. 11-14]

References

[Cha19] E. Charniak, Introduction to Deep Learning, MIT Press, 2019. Available at https://ebookcentral.proquest.com/lib/unisa/reader.action?docID=6331506.