17.學習。提高 (17. Learning: Boosting) - VoiceTube 看影片學英語

字幕列表影片播放

PATRICK WINSTON: We've now almost completed our journey.
This will be it for talking about
several kinds of learning--
the venerable kind, that's the nearest neighbors and
identification tree types of learning.
Still useful, still the right thing to do if there's no
reason not to do the simple thing.
Then we have the biologically-inspired
approaches.
Neural nets.
All kinds of problems with local maxima and overfitting
and oscillation, if you get the rate constant too big.
Genetic algorithms.
Like neural nets, both are very naive in their attempt to
mimic nature.
So maybe they work on a class of problems.
They surely do each have a class of problems for which
they're good.
But as a general purpose first resort, I don't recommend it.
But now the theorists have come out and done some things
are very remarkable.
And in the end, you have to say, wow, these are such
powerful ideas.
I wonder if nature has discovered them, too?
Is there good engineering in the brain,
based on good science?
Or given the nature of evolution, is it just random
junk that is the best ways for doing anything?
Who knows?
But today, we're going to talk about an idea that I'll bet is
in there somewhere, because it's easy to implement, and
it's extremely powerful in what it does, and it's the
essential item in anybody's repertoire of learning
mechanisms.
It's also a mechanism which, if you understand only by
formula, you will never be able to work the problems on
the quiz, that's for sure.
Because on the surface, it looks like it'd be very
complicated to simulate this approach.
But once you understand how it works and look at a little bit
of the math and let it sing songs to you, it turns out to
be extremely easy.
So it's about letting multiple methods work in your behalf.
So far, we've been talking about using just one method to
do something.
And what we're going to do now is we're looking to see if a
crowd can be smarter than the individuals in the crowd.
But before we get too far down that abstract path, let me
just say that the whole works has to do with classification,
and binary classification.
Am I holding a piece of chalk in my hand, or a hand grenade?
Is that a cup of coffee or tea?
Those are binary classification problems.
And so we're going to be talking today strictly about
binary classification.
We're not going to be talking about finding the right letter
in the alphabet that's written on the page.
That's a 26-way choice.
We're talking about binary choices.
So we assume that there's a set of classifiers
that we can draw on.
Here's one--
h.
And it produces either a minus 1 or a plus 1.
So that's how the classification is done.
If it's coffee, plus 1.
If it's tea, minus 1.
Is this chalk, plus one.
If it's a hand grenade, minus 1.
So that's how the classification works.
Now, too bad for us, normally the world doesn't give us very
good classifiers.
So if we look at the error rate of this classifier or any
other classifier, that error rate will range from 0 to 1 in
terms of the fraction of the cases got
wrong on a sample set.
So you'd like your error rate to be way down here.
You're dead if it's over there.
But what about in the middle?
What if it's, say, right there.
Just a little bit better than flipping a coin.
If it's just a little bit better than flipping a coin,
that's a weak classifier.
And the question is, can you make a classifier that's way
over here, like there, a strong classifier, by
combining several of these weak classifiers, and
letting them vote?
So how would you do that?
You might say, well, let us make a big classifier capital
H, that works on some sample x, and has its output produces
something that depends on the sum of the outputs of the
individual classifiers.
So we have H1 working on x.
We have H2 working on x.
And we have H3 also working on x.
Let's say three of them, just to start us off.
And now let's add those guys up, and take
the sign of the output.
So if two out of the three of those guys agree, then we'll
get an either plus 1 or minus 1.
If all three agree, we'll get plus 1 or minus 1.
Because we're just taking the sign.
We're just taking the sign of the sum of these guys.
So this means that one guy can be wrong, as long as the other
two guys are right.
But I think it's easier to see how this all works if you
think of some space of samples, you say, well, let's
let that area here be where H1 is wrong, and this area over
here is where H2 is wrong.
And then this area over here is where H3 is wrong.
So if the situation is like that, then this formula always
gives you the right answers on the samples.
I'm going to stop saying that right now, because I want to
be kind of a background thing on the samples set.
We're talking about wrapping this stuff
over the sample set.
Later on, we'll ask, OK, given that you trained this thing on
a sample set, how well does it do on some new examples?
Because we want to ask ourselves
about overfitting questions.
But for now, we just want to look and see if we believe
that this arrangement, where each of these H's is producing
plus 1 or minus 1, we're adding them up and taking the
sign, is that going to give us a better result than the tests
individually?
And if they look like this when draped over a sample set,
then it's clear that we're going to get the right answer
every time, because there's no area here where any two of
those tests are giving us the wrong answer.
So the two that are getting the right answer, in this
little circle here for H1, these other two are getting
the right answer.
So they'll outvote it, and you'll get the right answer
every time.
But it doesn't have to be that simple.
It could look like this.
There could be a situation where this
is H1, wrong answer.
This is H2, wrong answer.
And this is H3, wrong answer.
And now the situation gets a little bit more murky, because
we have to ask ourselves whether that area where three
out of the three get it wrong is sufficiently big so as to
be worse than 1 of the individual tests.
So if you look at that Venn diagram, and stare at it long
enough, and try some things, you can say, well, there is no
case where this will give a worse answer.
Or, you might end up with the conclusion that there are
cases where we can arrange those circles such that the
voting scheme will give an answer that's worst than an
individual test, but I'm not going to tell you the answer,
because I think we'll make that a quiz question.
Good idea?
OK.
So we'll make that a quiz question.
So that looks like a good idea.
And we can construct a little algorithm that will help us
pick the particular weak classifiers to plug in here.
We've got a whole bag of classifiers.
We've got H1, we've got H2, we've got H55.
We've got a lot of them we can choose from.
So what we're going to do is we're going to use the data,
undisturbed, to produce H1.
We're just going to try all the tests on the data and see
which one gives us the smallest error rate.
And that's the good guy, so we're going to use that.
Then we're going to use the data with an
exaggeration of H1 errors.
In other words--
this is a critical idea.
What we're going to do is we're going to run this
algorithm again, but instead of just looking at the number
of samples that are got wrong, what we're going to do is
we're going to look at a distorted set of samples,
where the ones we're not doing well on has exaggerated effect
on the result.
So we're going to weight them or multiply them, or do
something so that we're going to pay more attention to the
samples on which H1 produces an error, and that's going to
give us H2.
And then we're going to do it one more time, because we've
got three things to go with here in this particular little
exploratory scheme.
And this time, we're going to have an
exaggeration of those samples--
which samples are we going to exaggerate now?
We might as well look for the ones where H1 gives us a
different answer from H2, because we want to be on the
good guy's side.
So we can say we're going to exaggerate those samples four
which H1 gives us a different result from H2.
And that's going to give us H3.
All right.
So we can think of this whole works here as part one of a
multi-part idea.
So let's see.
I don't know, what might be step two?
Well, this is a good idea.
Then what we've got that we can easily derive from that is
a little tree looked like this.
And we can say that H of x depends on H1, H2, and H3.
But now, if that that's a good idea, and that gives a better
answer than any of the individual tests, maybe we can
make this idea a little bit recursive, and say, well,
maybe H1 is actually not an atomic test.
But maybe it's the vote of three other tests.
So you can make a tree structure
that looks like this.
So this is H11, H12, H13, and then 3 here.
And then this will be H31, H32, H33.
And so that's a sort of get out the vote idea.
We're trying to get a whole bunch of individual
tests into the act.
So I guess the reason this wasn't discovered until about
'10 years ago was because you've got to get so many of
these desks all lined up before the idea gets through
that long filter of ideas.
So that's the only idea number two of quite a few.
Well, next thing we might think is, well, we keep
talking about these classifiers.
What kind of classifiers are we talking about?
I've got--
oh, shoot, I've spent my last nickel.
I don't have a coin to flip.
But that's one classifier, right?
The trouble with that classifier is it's a weak
classifier, because it gives me a 50/50
chance of being right.
I guess there are conditions in which a coin flip
is better than a--
it is a weak classifier.
If the two outcomes are not equally probable, than a coin
flip is a perfectly good weak classifier.
But what we're going to do is we're going to think in terms
of a different set of classifiers.
And we're going to call them decision tree.
Now, you remember decision trees, right?
But we're not going to build decision trees.
We're going to use decision tree stumps.
So if we have a two-dimensional space that
looks like this, then a decision tree stump is a
single test.
It's not a complete tree that will divide up the samples
into homogeneous groups.
It's just what you can do with one test.
So each possible test is a classifier.
How many tests do we get out of that?
12, right?
Yeah.
It doesn't look like 12 to me, either.
But here's how you get to 12.
One decision tree test you can stick in there would be that
test right there.
And that would be a complete decision tree stump.
But, of course, you can also put in this one.
That would be another decision tree stump.
Now, for this one on the right, I could say, everything
on the right is a minus.
Or, I could say, everything on the right is a plus.
It would happen to be wrong, but it's a valid test with a
valid outcome.
So that's how we double the number of test that
we have lines for.
And you know what?
can even have a kind of test out here that says everything
is plus, or everything is wrong.
So for each dimension, the number of decision tree stumps
is the number of lines I can put in times 2.
And then I've got two dimensions here, that's how I
got to twelve.
So there are three lines.
I can have the pluses on either the left
or the right side.
So that's six.
And then I've got two dimensions, so
that gives me 12.
So that's the decision tree stump idea.
And here are the other decision tree boundaries,
obviously just like that.
So that's one way can generate a batch of tests to try out
with this idea of using a lot of tests to help
you get the job done.
STUDENT: Couldn't you also have a decision tree on the
right side?
PATRICK WINSTON: The question is, can you also have a test
on the right side?
See, this is just a stand-in for saying, everything's plus
or everything's minus.
So it doesn't matter where you put the line.
It can be on the right side, or the left side, or the
bottom, or the top.
Or you don't have to put the line anywhere.
It's just an extra test, an additional to the ones you put
between the samples.
So this whole idea of boosting, the
main idea of the day.
Does it depend on using decision tree stumps?
The answer is no.
Do not be confused.
You can use boosting with any kind of classifier.
so why do I use decision tree stumps today?
Because it makes my life easy.
We can look at it, we can see what it's doing.
But we could put bunch of neural nets in there.
We could put a bunch of real decision trees in there.
We could put a bunch of nearest
neighbor things in there.
The boosting idea doesn't care.
I just used these decision tree stumps because I and
everybody else use them for illustration.
All right.
We're making progress.
Now, what's the error rate for any these tests
and lines we drew?
Well, I guess it'll be the error rate is equal to the sum
of 1 over n--
That's the total number of points,
the number of samples--
summed over the cases where we are wrong.
So gee, we're going to work on combining some of these ideas.
And we've got this notion of exaggeration.
At some stage in what we're doing here, we're going to
want to be able to exaggerate the effect of some errors
relative to other errors.
So one thing we can do is we can assume, or we can
stipulate, or we can assert that each of these samples has
a weight associated with it.
That's W1, this is W2, and that's W3.
And in the beginning, there's no reason to suppose that any
one of these is more or less important
than any of the other.
So in the beginning, W sub i at time [? stub ?] one is
equal to 1 over n.
So the error is just adding up the number of samples that
were got wrong.
And that'll be the fraction of samples to that
you didn't get right.
And that will be the error rate.
So what we want to do is we want to say, instead of using
this as the error rate for all time, what we want to do is we
want to move that over, and say that the error rate is
equal to the sum over the things you got wrong in the
current step, times the weights of those
that were got wrong.
So in step one, everything's got the same weight, it
doesn't matter.
But if we find a way to change their weights going
downstream--
so as to, for example, highly exaggerate that third sample,
then W3 will go up relative to W1 and W2.
The one thing we want to be sure of is there is no matter
how we adjust the weights, that the sum of the weights
over the whole space is equal to 1.
So in other words, we want to choose the weights so that
they emphasize some of the samples, but we also want to
put a constraint on the weights such that all of them
added together is summing to one.
And we'll say that that enforces a distribution.
A distribution is a set of weights that sum to one.
Well, that's just a nice idea.
So we're make a little progress.
We've got this idea that we can add some plus/minus 1
classifiers together, you get a better classifier.
We got some idea about how to do that.
It occurs to us that maybe we want to get a lot of
classifiers into the act somehow or another.
And maybe we want to think about using decision tree
stumps so as to ground out thinking about all this stuff.
So the next step is to say, well, how actually should we
combine this stuff?
And you will find, in the literature libraries, full of
papers that do stuff like that.
And that was state of the art for quite a few years.
But then people began to say, well, maybe we can build up
this classifier, H of x, in multiple steps and get a lot
of classifiers into the act.
So maybe we can say that the classifier is the sign of H--
that's the one we picked first.
That's the classifier we picked first.
That's looking at samples.
And then we've got H2.
And then we've got H3.
And then we've got how many other classifiers we might
want, or how many classifiers we might need in order to
correctly classify everything in our sample set.
So people began to think about whether there might be an
algorithm that would develop a classifier that way,
one step at a time.
That's why I put that step number in the exponent,
because we're picking this one at first, then we're expanding
it to have two, and then we're expanding it to have
three, and so on.
And each of those individual classifiers are separately
looking at the sample.
But of course, it would be natural to suppose that just
adding things up wouldn't be enough.
And it's not.
So it isn't too hard to invent the next idea, which is to
modify this thing just a little bit by doing what?
It looks almost like a scoring polynomial, doesn't it?
So what would we do to tart this up a little bit?
STUDENT: [INAUDIBLE].
PATRICK WINSTON: Come again?
Do what?
STUDENT: [INAUDIBLE].
PATRICK WINSTON: Somewhere out there someone's murmuring.
STUDENT: Add--
PATRICK WINSTON: Add weights!
STUDENT: --weights.
Yeah.
PATRICK WINSTON: Excellent.
Good idea.
So what we're going to do is we're going to have alphas
associated with each of these classifiers, and we're going
to determine if somebody can build that kind
formula to do the job.
So maybe I ought to modify this gold star idea before I
get too far downstream.
And we're not going to treat everybody in a crowd equally.
We're going to wait some of the opinions more than others.
And by the way, they're all going to make errors in
different parts of the space.
So maybe it's not the wisdom of even a weighted crowd, but
a crowd of experts.
Each of which is good at different parts of the space.
So anyhow, we've got this formula, and there are a few
things that one can say turn out.
But first, let's write down the an algorithm for what this
ought to look like.
Before I run out of space, I think I'll exploit the right
hand board here, and put the overall algorithm right here.
So we're going to start out by letting of all the weights at
time 1 be equal to 1 over n.
That's just saying that they're all equal in the
beginning, and they're equal to 1 over n.
And n is the number of samples.
And then, when I've got that, I want to
compute alpha, somehow.
Let's see.
No, I don't want to do that.
I want to
I want to pick a classifier the minimizes the error rate.
And then m, i, zes, error at time t.
And that's going to be at time t.
And we're going to come back in here.
That's why we put a step index in there.
So once we've picked a classifier that produces an
error rate, then we can use the error rate to
determine the alpha.
So I want the alpha over here.
That'll be sort of a byproduct of picking that test.
And with all that stuff in hand, maybe that will be
enough to calculate Wt plus 1.
So we're going to use that classifier that we just picked
to get some revised weights, and then we're going to go
around that loop until this classifier produces a perfect
set of conclusions on all the sample data.
So that's going to be our overall strategy.
Maybe we've got, if we're going to number these things,
that's the fourth big idea.
And this arrangement here is the fifth big idea.
Then we've got the sixth big idea.
And the sixth big idea says this.
Suppose that the weight on it ith sample at time t plus 1 is
equal to the weight at time t on that same sample, divided
by some normalizing factor, times e to the minus alpha at
time t, times h at time t, times some function y which is
a function of x, But not a function of time.
Now you say, where did this come from?
And the answer is, it did not spring from the heart of
mathematician in the first 10 minutes that he
looked at this problem.
In fact, when I asked [INAUDIBLE]
how this worked, he said, well, he was thinking about
this on the couch every Saturday for about a year, and
his wife was getting pretty sore, but he finally found it
and saved their marriage.
So where does stuff like this come from?
Really, it comes from knowing a lot of mathematics, and
seeing a lot of situations, and knowing that something
like this might be mathematically convenient.
Something like this might be mathematically convenient.
But we've got to back up a little and let it sing to us.
What's y?
We saw y last time.
The support vector machines.
That's just a function.
That's plus 1 or minus 1, depending on whether the
output ought to be plus 1 or minus 1.
So if this guy is giving the correct answer, and the
correct answer is plus, and then this guy will be plus 1
too, because it always gives you the correct answer.
So in that case, where this guy is giving the right
answer, these will have the same sign, so that will be a
plus 1 combination.
On the other hand, if that guy's giving the wrong answer,
you're going to get a minus 1 out of that combination.
So it's true even if the right answer should be minus, right?
So if the right answer should be minus, and this is plus,
then this will be minus 1, and the whole combination well
give you minus 1 again.
In other words, the y just flips the sign if you've got
the wrong answer, no matter whether the wrong answer is
plus 1 or minus 1.
These alphas--
shoot, those are the same alphas that are in this
formula up here, somehow.
And then that z, what's that for?
Well, if you just look at the previous weights, and its
exponential function to produce these W's for the next
generation, that's not going to be a distribution, because
they won't sum up to 1.
So what this thing here, this z is, that's a sort of
normalizer.
And that makes that whole combination of new
weights add up to 1.
So it's whatever you got by adding up all those guys, and
then dividing by that number.
Well, phew.
I don't know.
Now there's some it-turns-out-thats.
We're going to imagine that somebody's done the same sort
of thing we did to the support vector machines.
We're going to find a way to minimize the error.
And the error we're going to minimize is the error produced
by that whole thing up there in 4.
We're going to minimize the error of that entire
expression as we go along.
And what we discover when we do the appropriate
differentiations and stuff--
you know, that's what we do in calculus--
what we discover is that you get minimum error for the
whole thing if alpha is equal to 1 minus the error rate at
time t, divided by the error rate at time t.
Now let's take the logarithm of that, and
multiply it by half.
And that's what [INAUDIBLE]
was struggling to find.
But we haven't quite got it right.
And so let me add this in separate chunks, so we don't
get confused about this.
It's a bound on that expression up there.
It's a bound on the error rate produced by that expression.
So interestingly enough, this means that the error rate can
actually go up as you add terms to this formula.
all you know is that the error rate is going to be bounded by
an exponentially decaying function.
So it's eventually guaranteed to converge on zero.
So it's a minimal error bound.
It turns out to be exponential.
Well, there it is.
We're done.
Would you like to see a demonstration?
Yeah, OK.
Because you look at that, and you say, well, how could
anything like that possibly work?
And the answer is, surprisingly enough, here's
what happens.
There's a simple little example.
So that's the first test chosen.
the greens are pluses and the reds are minuses, so it's
still got an error.
Still got an error-- boom.
There, in two steps.
It now has--
we can look in the upper right hand corner--
we see its used three classifiers, and we see that
one of those classifiers says that everybody belongs to a
particular class, three different weights.
And the error rate has converged to 0.
So let's look at a couple of other ones.
Here is the one I use for debugging this thing.
We'll let that run.
See how fast it is?
Boom.
It converges to getting all the samples right very fast.
Here's another one.
This is one we gave on an exam a few years back.
First test.
Oh, I let it run, so it got everything
instantaneously right.
Let's take that through step at a time.
There's the first one, second one.
Still got a lot of errors.
Ah, the error rate's dropping.
And then flattened, flattened, and it goes to 0.
Cool, don't you think?
But you say to me, bah, who cares about that stuff?
Let's try something more interesting.
There's one.
That was pretty fast, too.
Well, there's not too many samples here.
So we can try this.
So there's an array of pluses and minuses.
Boom.
You can see how that error rate is bounded by an
exponential?
So in a bottom graph, you've got the number of classifiers
involved, and that goes up to a total, eventually, of 10.
You can see how positive or negative each of the
classifiers that's added is by looking at
this particular tab.
And this just shows how they evolve over time.
But the progress thing here is the most interesting.
And now you say to me, well, how did the machine do that?
And it's all right here.
We use an alpha that looks like this.
And that allows us to compute the new weights.
It says we've got a preliminary calculation.
We've got to find a z that does the normalization.
And we sure better bring our calculator, because we've got,
first of all, to calculate the error rate.
Then we've got to take its logarithm, divide by 2, plug
it into that formula, take the exponent, and that gives us
the new weight.
And that's how the program works.
And if you try that, I guarantee you
will flunk the exam.
Now, I don't care about my computer.
I really don't.
It's a slave, and it can calculate these logarithm and
exponentials till it turns blue, and I don't care.
Because I've got four cores or something, and who cares.
Might as well do this, than sit around
just burning up heat.
But you don't want to do that.
So what you want to do is you want to know how to do this
sort of thing more expeditiously.
So we're going to have to let them the math sing to us a
little bit, with a view towards finding better ways of
doing this sort of thing.
So let's do that.
And we're going to run out of space here before long, so let
me reclaim as much of this board as I can.
So what I'm going to do is I'm going to say, well, now that
we've got this formula for alpha that relates alpha t to
the error, then I can plug that into this formula up
here, number 6.
And what I'll get is that the weight of t plus 1 is equal to
the weight at t divided by that normalizing factor,
multiplied times something that depends on whether it's
categorized correctly or not.
That's what that y's in their for, right?
So we've got a logarithm here, and we got a sign flipper up
there in terms of that H of x and y combination.
So if the sign of that whole thing at minus alpha and that
y H combination turns out to be negative, then we're going
to have to flip the numerator and denominator here in this
logarithm, right?
And oh, by the way, since we've got a half out here,
that turns out to be the square root of that term
inside the logarithm.
So when we carefully do that, what we discover is that it
depends on whether it's the right thing or not.
But what it turns out to be is something like a multiplier of
the square root.
Better be careful, here.
The square root of what?
STUDENT: [INAUDIBLE].
PATRICK WINSTON: Well, let's see.
But we have to be careful.
So let's suppose that this is 4 things that we get correct.
So if we get it correct, then we're going to get the same
sign out of H of x and y.
We've get a minus sign out there, so we're going to flip
the numerator and denominator.
So we're going to get the square root of e of t over 1
minus epsilon of t if that's correct.
If it's wrong, it'll just be the flip of that.
So it'll be the square root of 1 minus the error rate over
the error rate.
Everybody with me on that?
I think that's right.
If it's wrong, I'll have to hang myself and wear a paper
bag over my head like I did last year.
But let's see if we can make this go correctly this time.
So now, we've got this guy here, we've got everything
plugged in all right, and we know that now this z ought to
be selected so that it's equal to the sum of this guy
multiplied by these things as appropriate for whether it's
correct or not.
Because we want, in the end, for all of these w's
to add up to 1.
So let's see what they add up to without the z there.
So what we know is that it must be the case that if we
add over the correct ones, we get the square root of the
error rate over 1 minus the rate of the Wt plus 1.
Plus now we've got the sum of 1 minus the error rate over
the error rate, times the sum of the Wi at time t for wrong.
So that's what we get if we added all these
up without the z.
So since everything has to add up to 1, then z ought to be
equal to this sum.
That looks pretty horrible, until we realize that if we
add these guys up over the weights that are wrong, that
is the error rate.
This is e.
So therefore, z is equal the square root of the error rate
times 1 minus the error rate.
That's the contribution of this term.
Now, let's see.
What is the sum of the weights over the
ones that are correct?
Well, that must be 1 minus the error rate.
Ah, so this thing gives you the same result as this one.
So z is equal to 2 times that.
And that's a good thing.
Now we are getting somewhere.
Because now, it becomes a little bit easier to write
some things down.
Well, we're way past this, so let's get rid of this.
And now we can put some things together.
Let me point out what I'm putting together.
I've got an expression for z right here.
And I've got an expression for the new w's here.
So let's put those together and say that w of t plus 1 is
equal to w of t.
I guess we're going to divide that by 2.
And then we've got this square root times that expression.
So if we take that correct one, and divide by that one,
then the [INAUDIBLE]
cancel out, and I get 1 over 1 minus the error rate.
That's it.
That's correct.
And if it's not correct, then it's Wt over 2--
and working through the math--
1 over epsilon, if wrong.
Do we feel like we're making any progress?
No.
Because we haven't let it sing to us enough yet.
So I want to draw your attention to what happens to
amateur rock climbers when they're halfway
up a difficult cliff.
They're usually [INAUDIBLE], sometimes they're not.
If they're not, they're scared to death.
And every once in a while, as they're just about to fall,
they find some little tiny hole to stick a fingernail in,
and that keeps them from falling.
That's called a thank-god hole.
So what I'm about to introduce is the analog of those little
places where you can stick your fingernail in.
It's the thank-god hole for dealing
with boosting problems.
So what happens if I add all these [? Wi ?]
up for the ones that the classifier where produces a
correct answer on?
Well, it'll be 1 over 2, and 1 over 1 minus epsilon, times
the sum of the Wt for which the answer was correct.
What's this sum?
Oh!
My goddess.
1 minus epsilon.
So what I've just discovered is that if I sum new w's over
those samples for which I got a correct answer,
it's equal to 1/2.
And guess what?
That means that if I sum them over wrong, it's equal to 1/2
half as well.
So that means that I take all of the weight for which I got
the right answer with the previous test, and those ways
will add up to something.
And to get the weights for the next generation, all I have to
do is scale them so that they equal half.
This was not noticed by the people who
developed this stuff.
This was noticed by Luis Ortiz, who was a 6.034
instructor a few years ago.
The sum of those weights is going to be a scaled version
of what they were before.
So you take all the weights for which this new
classifier--
this one you selected to give you the minimum weight on the
re-weighted stuff--
you take the ones that it gives a correct answer for,
and you take all of those weights, and you just scale
them so they add up to 1/2.
So do you have to compute any logarithms?
No.
Do you have to compute any exponentials?
No.
Do you have to calculate z?
No.
Do you have to calculate alpha to get the new weights?
No.
All you have to do is scale them.
And that's a pretty good thank-god hole.
So that's thank-god hole number one.
Now, for thank-god hole number two, we need to go back and
think about the fact that were going to give you problems in
probability that involve decision tree stumps.
And there are a lot of decision tree stumps that you
might have to pick from.
So we need a thank-god hole for deciding how
to deal with that.
Where can I find some room?
How about right here.
Suppose you've got a space that looks like this.
I'm just makings this up at random.
So how many--
let's see.
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11.
How many tests do I have to consider in that dimension?
11.
It's 1 plus the number of samples.
That would be horrible.
I don't know.
Do I have actually calculate this one?
How could that possibly be better than that one?
It's got one more thing wrong.
So that one makes sense.
The other one doesn't make sense.
So in the end, no test that lies between two correctly
classified samples will ever be any good.
So that one's a good guy, and that one's a good guy.
And this one's a bad guy.
Bad guy, bad guy bad guy, bad guy.
Bad guy, bad guy, bad buy.
So the actual number of tests you've got is three.
And likewise, in the other dimension--
well, I haven't drawn it so well here, but would this test
be a good one?
No.
That one?
No.
Actually, I'd better look over here on the right and see what
I've got before I draw too many conclusions.
Let's look over this, since I don't want to think too hard
about what's going on in the other dimension.
But the idea is that very few of those
tests actually matter.
Now, you say to me, there's one last thing.
What about overfitting?
Because all this does is drape a solution over the samples.
And like support vector machines overfit, neural maps
overfit, identification trees overfit.
Guess what?
This doesn't seem to overfit.
That's an experimental result for which the
literature is confused.
It goes back to providing an explanation.
So this stuff is tried on all sorts of problems, like
handwriting recognition, understanding speech, all
sorts of stuff uses boosting.
And unlike other methods, for some reason as yet imperfectly
understood, it doesn't seem to overfit.
But in the end, they leave no stone unturned in 6.034.
Every time we do this, we do some additional experiments.
So here's a sample that I'll leave you with.
Here's a situation in which we have a 10-dimensional space.
We've made a fake distribution, and then we put
in that boxed outlier.
That was just put into the space at random, so it can be
viewed as an error point.
So now what we're going to do is we're going to see what
happens when we run that guy.
And sure enough, in 17 steps, it finds a solution.
But maybe it's overfit that little guy who's an error.
But one thing you can do is you can say, well, all of
these classifiers are dividing this space up into chunks, and
we can compute the size of the space occupied by any sample.
So one thing we can do--
alas, I'll have to get up a new demonstration.
One thing we can do, now that this guy's over here, we can
switch the volume tab and watch how the volume occupied
by that error point evolves as we solve the problem.
So look what happens.
This is, of course, randomly generated.
I'm counting on this working.
Never failed before.
So it originally starts out as occupying 26%
of the total volume.
It ends up occupying 1.4 times 10 to the
minus 3rd% of the volume.
So what tends to happen is that these decision tree
stumps tend to wrap themselves so tightly around the error
points, there's no room for overfitting, because nothing
else will fit in that same volume.
So that's why I think that this thing tends to produce
solutions which don't overfit.
So in conclusion, this is magic.
You always want to use it.
It'll work with any kind of [? speed ?] of
classifiers you want.
And you should understand it very thoroughly, because of
anything is useful in the subject in dimension learning,
this is it.