Special | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ALL
Problems of vanishing gradients and exploding gradients | ||||||||
---|---|---|---|---|---|---|---|---|
This knowledge base entry follows discussion of artificial neural networks and backpropagation. The backpropagation (“backprop” for short) algorithm calculates gradients to update each weight. Unfortunately, gradients often shrink as the algorithm progresses down to the lower layers, with the result that the lower layers’ weights remain virtually unchanged, and training fails to converge to a good solution — this is called the vanishing gradients problem [G22, Ch. 11]. The opposite can also happen: the gradients can keep growing until the layers get excessively large weight updates and the algorithm diverges — this is the exploding gradients problem [G22, Ch. 11]. Both problems plague deep neural networks (DNNs) and recurrent neural networks (RNNs) over very long sequences [Mur22, Sec. 13.4.2]. More generally, deep neural networks suffer from unstable gradients, and different layers may learn at widely different speeds. Watch Prof Ng’s explanation of the problems: The problems were observed decades ago and were the reasons why DNNs were mostly abandoned in the early 2000s [G22, Ch. 11].
Watch Prof Ng’s explanation of weight initialisation: References
| ||||||||