A long short-term memory (LSTM) network is a type of recurrent neural network (RNN) designed to address the problems of vanishing gradients and exploding gradients using gradient truncation and structures called “constant error carousels” for enforcing constant (as opposed to vanishing or exploding) error flow[HS97].
LSTM solves such a fundamental problem with traditional RNNs that most of the state-of-the-art results achieved through RNNs can be attributed to LSTM[YSHZ19].
An LSTM network replaces the traditional neural network layers with LSTM layers, each of which consists of a set of recurrently connected, differentiable memory blocks[GS05].
Each LSTM block typically contains one recurrently connected memory cell, called an LSTM cell (to be distinguished from a neuron, which is also called a node or unit), but can contain multiple cells.
Fig. 1 illustrates the structure of an LSTM cell, which acts on the current input, , and the output of the preceding LSTM cell, .
The forget gate is a later addition[GSC00] to the original LSTM design; it determines based on and the amount of information to be discarded from the cell state[YSHZ19]:
where and . In the preceding equations,
, and are weights and bias associated with the forget gate;
, and are weights and bias associated with the cell;
, and are weights and bias associated with the input gate.
When the output of the forget gate, , is 1, all information in is retained, and when the output is zero, all information is discarded.
The cell output, , is the product:
where , and are the weights and bias associated with the output gate.
Fig. 1: An LSTM block with one memory cell, which contains a forget gate acting on the current input, , and the output of the preceding LSTM cell, . and are the cell states for the preceding cell and current cell respectively. While the forget gate scales the cell state, the input and output gates scale the input and output of the cell respectively. The activation functions and , also called squashing functions, are usually . The multiplication represented by ⨀ is element-wise. Omitted from the diagram are the weights and bias associated with the 1️⃣ forget gate, 2️⃣ cell, 3️⃣ input gate, and 4️⃣ output gate. Diagram adapted from [VHMN20, Fig. 1], [YSHZ19, Figure 3] and [GS05, Fig. 1].
LSTM networks can be classified into two main types[YSHZ19, VHMN20]:
LSTM-dominated networks
These are neural networks with LSTM cells as the dominant building blocks.
The design of these networks focuses on optimising the interconnections of the LSTM cells.
Examples include bidirectional LSTM networks, which are extensions of bidirectional RNNs.
The original bidirectional LSTM network[GS05] uses a variation of backpropagation through time[Wer90] for training.
Integrated LSTM networks
These are hybrid neural networks consisting of LSTM and non-LSTM layers.
The design of these networks focuses on integrating the strengths of the different types of layers.
For example, convolutional layers and LSTM layers have been integrated in a wide variety of ways.
Among the many possibilities, the CNN-LSTM architecture is widely used. It can for example be used to predict residential energy consumption[KC19]:
Kim and Cho’s design[KC19] consists of two convolutional-pooling layers, an LSTM layer and two fully connected (or dense) layers.
The convolutional-pooling layers extract features among several variables that affect energy consumption prediction.
The output of the convolutional-pooling layers is fed to the LSTM layer, after denoising, to extract temporal features. The LSTM layer can remember irregular trends.
The output of the LSTM layer is fed to two fully connected layers, the second of which generates a predicted time series of energy consumption.
F. A. Gers, J. Schmidhuber, and F. Cummins, Learning to forget: Continual prediction with LSTM, Neural Computation12 no. 10 (2000), 2451–2471. https://doi.org/10.1162/089976600300015015.
[GSC05]
A. Graves and J. Schmidhuber, Framewise phoneme classification with bidirectional LSTM networks, in 2005 IEEE International Joint Conference on Neural Networks, 4, 2005, pp. 2047–2052. https://doi.org/10.1109/IJCNN.2005.1556215.
K. P. Murphy, Probabilistic Machine Learning: An Introduction, MIT Press, 2022. Available at http://probml.ai.
[VHMN20]
G. Van Houdt, C. Mosquera, and G. Nápoles, A review on the long short-term memory model, Artificial Intelligence Review53 no. 8 (2020), 5929–5955. https://doi.org/10.1007/s10462-020-09838-1.
[Wer90]
P. Werbos, Backpropagation through time: what it does and how to do it, Proceedings of the IEEE78 no. 10 (1990), 1550–1560. https://doi.org/10.1109/5.58337.
[YSHZ19]
Y. Yu, X. Si, C. Hu, and J. Zhang, A review of recurrent neural networks: LSTM cells and network architectures, Neural Computation31 no. 7 (2019), 1235–1270. https://doi.org/10.1162/neco_a_01199.
[ZLLS23]
A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep Learning, Cambridge University Press, 2023. Available at https://d2l.ai/.