


The idea behind clipping-by-value is simple. We will also break down our flow into two parts: These equations were obtained by writing the gradients in a sum-of-products form. We will diverge from classical BPTT equations at this point, and re-write the gradients in order to better highlight the exploding gradients problem. Let’s work through some equations and see the root where it all starts. We denote by a the hidden state of the network at time t, by x the input and ŷ the output of the network at time t, and by C the error obtained from the output at time t. Unrolling recurrent neural networks in time by creating a copy of the model for each time step: In other words, it’s backpropagation on an unrolled RNN. Deep dive into exploding gradients problemįor calculating gradients in a Deep Recurrent Networks we use something called Backpropagation through time (BPTT), where the recurrent model is represented as a deep multi-layer one (with an unbounded number of layers), and backpropagation is applied on the unrolled model.
#The gradient update
When the parameters get close to such a cliff region, a gradient descent update can catapult the parameters very far, possibly losing most of the optimization work that had been done. These nonlinearities give rise to very high derivatives in some places. This results in an unstable network that at best cannot learn from the training data, making the gradient descent step impossible to execute.ĪrXiv:1211.5063 : The objective function for highly nonlinear deep neural networks or for recurrent neural networks often contains sharp nonlinearities in parameter space resulting from the multiplication of several parameters. Such events are caused by an explosion of long-term components, which can grow exponentially more than short-term ones. On the other hand, the Exploding gradients problem refers to a large increase in the norm of the gradient during training. The weights can no longer contribute to the reduction in cost function(C), and go unchanged affecting the network in the Forward Pass, eventually stalling the model. This is what we call Vanishing Gradients. The translation of the effect of a change in cost function(C) to the weight in an initial layer, or the norm of the gradient, becomes so small due to increased model complexity with more hidden units, that it becomes zero after a certain point. What about deeper networks, like Deep Recurrent Networks? Pretty cool, because now you get to adjust all the weights and biases according to the training data you have. It tells you about all the changes you need to make to your weights to minimize the cost function (it’s actually -1*∇ to see the steepest decrease, and +∇ would give you the steepest increase in the cost function). The Backpropagation algorithm is the heart of all modern-day Machine Learning applications, and it’s ingrained more deeply than you think.īackpropagation calculates the gradients of the cost function w.r.t – the weights and biases in the network. Vanishing and Exploding Gradients in Neural Network Models: Debugging, Monitoring, and Fixing Common problems with backpropagation
