一個老問題--第5集（深度學習SIMPLIFIED）。 (An Old Problem - Ep. 5 (Deep Learning SIMPLIFIED))

字幕列表影片播放

So now you’re probably thinking – wow, deep nets are really great! But why did it
take so long for them to become popular? Well as it turns out, when you try to train them
with a method called backpropagation, you run into a fundamental problem called the
vanishing gradient, or sometimes the exploding gradient. When that happens, training takes
too long and the accuracy really suffers. Let’s take a closer look.
When you’re training a neural net, you’re constantly calculating a cost value. The cost
is typically the difference between the net’s predicted output and the actual output from
a set of labelled training data. The cost is then lowered by making slight adjustments
to the weights and biases over and over throughout the training process, until the lowest possible
value is obtained. Here is that forward prop again; and here are the example weights and
biases. The training process utilizes something called a gradient, which measures the rate
at which the cost will change with respect to a change in a weight or a bias.
Deep architectures are your best and sometimes your only choice for complex machine learning
problems such as facial recognition. But up until 2006, there was no way to accurately
train deep nets due to a fundamental problem with the training process: the vanishing gradient.
Let’s think of a gradient like a slope, and the training process like a rock rolling
down that slope. A rock will roll quickly down a steep slope but will barely move at
all on a flat surface. The same is true with the gradient of a deep net. When the gradient
is large, the net will train quickly. When the gradient is small, the net will train
slowly. Here's that deep net again. And here is how the gradient could potentially vanish
or decay back through the net. As you can see, the gradients are much smaller in the
earlier layers. As a result, the early layers of the network are the slowest to train. But
this is a fundamental problem! The early layers are responsible for detecting the simple patterns
and the building blocks – when it came to facial recognition, the early layers detected
the edges which were combined to form facial features later in the network. And if the
early layers get it wrong, the result built up by the net will be wrong as well. It could
mean that instead of a face like this, your net looks for this.
The process used for training a neural net is called back-propagation or back-prop. We
saw before that forward prop starts with the inputs and works forward; back-prop does the
reverse, calculating the gradient from right to left. For example, here are 5 gradients,
4 weight and 1 bias. It starts with the left and works back through the layers, like so.
Each time it calculates a gradient, it uses all the previous gradients up to that point.
So, lets start with that node. That edge uses the gradient at that node. And the next. So
far things are simple. As you keep going back, things get a bit more complex - that one for
example uses a lot of gradients, even though this is a relatively simple net. If your net
gets larger and deeper, like this one, it gets even worse. But why is that? Well, a
gradient at any point is the product of the previous gradients up to that point. And the
product of two numbers between 0 and 1 gives you a smaller number. Say this rectangle is
a one. Also, say there are two gradients - a fourth - like that - and a third. If you multiply
them, you get a fourth of a third which is a twelfth. A fourth of a twelfth is a forty-eighth.
You can see that numbers keep getting smaller the more you multiply.
Have you ever had this issue while training a neural network with backpropagation? If
so, please comment and let me know your thoughts.
As a result of all this, backprop ends up taking a lot of time to train the net, and
the accuracy is often very low.
Up until 2006, deep nets were still underperforming shallow nets and other machine learning algorithms.
But everything changed after three breakthrough papers published by Hinton, Lecun, and Bengio
in 2006 and 2007. In the next video, we’ll begin taking a closer look at these breakthroughs,
starting with the Restricted Boltzmann Machine.