Placeholder Image

字幕列表 影片播放

  • PATRICK WINSTON: We've now almost completed our journey.

  • This will be it for talking about

  • several kinds of learning--

  • the venerable kind, that's the nearest neighbors and

  • identification tree types of learning.

  • Still useful, still the right thing to do if there's no

  • reason not to do the simple thing.

  • Then we have the biologically-inspired

  • approaches.

  • Neural nets.

  • All kinds of problems with local maxima and overfitting

  • and oscillation, if you get the rate constant too big.

  • Genetic algorithms.

  • Like neural nets, both are very naive in their attempt to

  • mimic nature.

  • So maybe they work on a class of problems.

  • They surely do each have a class of problems for which

  • they're good.

  • But as a general purpose first resort, I don't recommend it.

  • But now the theorists have come out and done some things

  • are very remarkable.

  • And in the end, you have to say, wow, these are such

  • powerful ideas.

  • I wonder if nature has discovered them, too?

  • Is there good engineering in the brain,

  • based on good science?

  • Or given the nature of evolution, is it just random

  • junk that is the best ways for doing anything?

  • Who knows?

  • But today, we're going to talk about an idea that I'll bet is

  • in there somewhere, because it's easy to implement, and

  • it's extremely powerful in what it does, and it's the

  • essential item in anybody's repertoire of learning

  • mechanisms.

  • It's also a mechanism which, if you understand only by

  • formula, you will never be able to work the problems on

  • the quiz, that's for sure.

  • Because on the surface, it looks like it'd be very

  • complicated to simulate this approach.

  • But once you understand how it works and look at a little bit

  • of the math and let it sing songs to you, it turns out to

  • be extremely easy.

  • So it's about letting multiple methods work in your behalf.

  • So far, we've been talking about using just one method to

  • do something.

  • And what we're going to do now is we're looking to see if a

  • crowd can be smarter than the individuals in the crowd.

  • But before we get too far down that abstract path, let me

  • just say that the whole works has to do with classification,

  • and binary classification.

  • Am I holding a piece of chalk in my hand, or a hand grenade?

  • Is that a cup of coffee or tea?

  • Those are binary classification problems.

  • And so we're going to be talking today strictly about

  • binary classification.

  • We're not going to be talking about finding the right letter

  • in the alphabet that's written on the page.

  • That's a 26-way choice.

  • We're talking about binary choices.

  • So we assume that there's a set of classifiers

  • that we can draw on.

  • Here's one--

  • h.

  • And it produces either a minus 1 or a plus 1.

  • So that's how the classification is done.

  • If it's coffee, plus 1.

  • If it's tea, minus 1.

  • Is this chalk, plus one.

  • If it's a hand grenade, minus 1.

  • So that's how the classification works.

  • Now, too bad for us, normally the world doesn't give us very

  • good classifiers.

  • So if we look at the error rate of this classifier or any

  • other classifier, that error rate will range from 0 to 1 in

  • terms of the fraction of the cases got

  • wrong on a sample set.

  • So you'd like your error rate to be way down here.

  • You're dead if it's over there.

  • But what about in the middle?

  • What if it's, say, right there.

  • Just a little bit better than flipping a coin.

  • If it's just a little bit better than flipping a coin,

  • that's a weak classifier.

  • And the question is, can you make a classifier that's way

  • over here, like there, a strong classifier, by

  • combining several of these weak classifiers, and

  • letting them vote?

  • So how would you do that?

  • You might say, well, let us make a big classifier capital

  • H, that works on some sample x, and has its output produces

  • something that depends on the sum of the outputs of the

  • individual classifiers.

  • So we have H1 working on x.

  • We have H2 working on x.

  • And we have H3 also working on x.

  • Let's say three of them, just to start us off.

  • And now let's add those guys up, and take

  • the sign of the output.

  • So if two out of the three of those guys agree, then we'll

  • get an either plus 1 or minus 1.

  • If all three agree, we'll get plus 1 or minus 1.

  • Because we're just taking the sign.

  • We're just taking the sign of the sum of these guys.

  • So this means that one guy can be wrong, as long as the other

  • two guys are right.

  • But I think it's easier to see how this all works if you

  • think of some space of samples, you say, well, let's

  • let that area here be where H1 is wrong, and this area over

  • here is where H2 is wrong.

  • And then this area over here is where H3 is wrong.

  • So if the situation is like that, then this formula always

  • gives you the right answers on the samples.

  • I'm going to stop saying that right now, because I want to

  • be kind of a background thing on the samples set.

  • We're talking about wrapping this stuff

  • over the sample set.

  • Later on, we'll ask, OK, given that you trained this thing on

  • a sample set, how well does it do on some new examples?

  • Because we want to ask ourselves

  • about overfitting questions.

  • But for now, we just want to look and see if we believe

  • that this arrangement, where each of these H's is producing

  • plus 1 or minus 1, we're adding them up and taking the

  • sign, is that going to give us a better result than the tests

  • individually?

  • And if they look like this when draped over a sample set,

  • then it's clear that we're going to get the right answer

  • every time, because there's no area here where any two of

  • those tests are giving us the wrong answer.

  • So the two that are getting the right answer, in this

  • little circle here for H1, these other two are getting

  • the right answer.

  • So they'll outvote it, and you'll get the right answer

  • every time.

  • But it doesn't have to be that simple.

  • It could look like this.

  • There could be a situation where this

  • is H1, wrong answer.

  • This is H2, wrong answer.

  • And this is H3, wrong answer.

  • And now the situation gets a little bit more murky, because

  • we have to ask ourselves whether that area where three

  • out of the three get it wrong is sufficiently big so as to

  • be worse than 1 of the individual tests.

  • So if you look at that Venn diagram, and stare at it long

  • enough, and try some things, you can say, well, there is no

  • case where this will give a worse answer.

  • Or, you might end up with the conclusion that there are

  • cases where we can arrange those circles such that the

  • voting scheme will give an answer that's worst than an

  • individual test, but I'm not going to tell you the answer,

  • because I think we'll make that a quiz question.

  • Good idea?

  • OK.

  • So we'll make that a quiz question.

  • So that looks like a good idea.

  • And we can construct a little algorithm that will help us

  • pick the particular weak classifiers to plug in here.

  • We've got a whole bag of classifiers.

  • We've got H1, we've got H2, we've got H55.

  • We've got a lot of them we can choose from.

  • So what we're going to do is we're going to use the data,

  • undisturbed, to produce H1.

  • We're just going to try all the tests on the data and see

  • which one gives us the smallest error rate.

  • And that's the good guy, so we're going to use that.

  • Then we're going to use the data with an

  • exaggeration of H1 errors.

  • In other words--

  • this is a critical idea.

  • What we're going to do is we're going to run this

  • algorithm again, but instead of just looking at the number

  • of samples that are got wrong, what we're going to do is

  • we're going to look at a distorted set of samples,

  • where the ones we're not doing well on has exaggerated effect

  • on the result.

  • So we're going to weight them or multiply them, or do

  • something so that we're going to pay more attention to the

  • samples on which H1 produces an error, and that's going to

  • give us H2.

  • And then we're going to do it one more time, because we've

  • got three things to go with here in this particular little

  • exploratory scheme.

  • And this time, we're going to have an

  • exaggeration of those samples--

  • which samples are we going to exaggerate now?

  • We might as well look for the ones where H1 gives us a

  • different answer from H2, because we want to be on the

  • good guy's side.

  • So we can say we're going to exaggerate those samples four

  • which H1 gives us a different result from H2.

  • And that's going to give us H3.

  • All right.

  • So we can think of this whole works here as part one of a

  • multi-part idea.

  • So let's see.

  • I don't know, what might be step two?

  • Well, this is a good idea.

  • Then what we've got that we can easily derive from that is

  • a little tree looked like this.

  • And we can say that H of x depends on H1, H2, and H3.

  • But now, if that that's a good idea, and that gives a better

  • answer than any of the individual tests, maybe we can

  • make this idea a little bit recursive, and say, well,

  • maybe H1 is actually not an atomic test.

  • But maybe it's the vote of three other tests.

  • So you can make a tree structure

  • that looks like this.

  • So this is H11, H12, H13, and then 3 here.

  • And then this will be H31, H32, H33.

  • And so that's a sort of get out the vote idea.

  • We're trying to get a whole bunch of individual

  • tests into the act.

  • So I guess the reason this wasn't discovered until about

  • '10 years ago was because you've got to get so many of

  • these desks all lined up before the idea gets through

  • that long filter of ideas.

  • So that's the only idea number two of quite a few.

  • Well, next thing we might think is, well, we keep

  • talking about these classifiers.

  • What kind of classifiers are we talking about?

  • I've got--

  • oh, shoot, I've spent my last nickel.

  • I don't have a coin to flip.

  • But that's one classifier, right?

  • The trouble with that classifier is it's a weak

  • classifier, because it gives me a 50/50

  • chance of being right.

  • I guess there are conditions in which a coin flip

  • is better than a--

  • it is a weak classifier.

  • If the two outcomes are not equally probable, than a coin

  • flip is a perfectly good weak classifier.

  • But what we're going to do is we're going to think in terms

  • of a different set of classifiers.

  • And we're going to call them decision tree.

  • Now, you remember decision trees, right?

  • But we're not going to build decision trees.

  • We're going to use decision tree stumps.

  • So if we have a two-dimensional space that

  • looks like this, then a decision tree stump is a

  • single test.

  • It's not a complete tree that will divide up the samples

  • into homogeneous groups.

  • It's just what you can do with one test.

  • So each possible test is a classifier.

  • How many tests do we get out of that?

  • 12, right?

  • Yeah.

  • It doesn't look like 12 to me, either.

  • But here's how you get to 12.

  • One decision tree test you can stick in there would be that

  • test right there.

  • And that would be a complete decision tree stump.

  • But, of course, you can also put in this one.

  • That would be another decision tree stump.

  • Now, for this one on the right, I could say, everything

  • on the right is a minus.

  • Or, I could say, everything on the right is a plus.

  • It would happen to be wrong, but it's a valid test with a

  • valid outcome.

  • So that's how we double the number of test that

  • we have lines for.

  • And you know what?

  • can even have a kind of test out here that says everything

  • is plus, or everything is wrong.

  • So for each dimension, the number of decision tree stumps

  • is the number of lines I can put in times 2.

  • And then I've got two dimensions here, that's how I

  • got to twelve.

  • So there are three lines.

  • I can have the pluses on either the left

  • or the right side.

  • So that's six.

  • And then I've got two dimensions, so

  • that gives me 12.

  • So that's the decision tree stump idea.

  • And here are the other decision tree boundaries,

  • obviously just like that.

  • So that's one way can generate a batch of tests to try out

  • with this idea of using a lot of tests to help

  • you get the job done.

  • STUDENT: Couldn't you also have a decision tree on the

  • right side?

  • PATRICK WINSTON: The question is, can you also have a test

  • on the right side?

  • See, this is just a stand-in for saying, everything's plus

  • or everything's minus.

  • So it doesn't matter where you put the line.

  • It can be on the right side, or the left side, or the

  • bottom, or the top.

  • Or you don't have to put the line anywhere.

  • It's just an extra test, an additional to the ones you put

  • between the samples.

  • So this whole idea of boosting, the

  • main idea of the day.

  • Does it depend on using decision tree stumps?

  • The answer is no.

  • Do not be confused.

  • You can use boosting with any kind of classifier.

  • so why do I use decision tree stumps today?

  • Because it makes my life easy.

  • We can look at it, we can see what it's doing.

  • But we could put bunch of neural nets in there.

  • We could put a bunch of real decision trees in there.

  • We could put a bunch of nearest

  • neighbor things in there.

  • The boosting idea doesn't care.

  • I just used these decision tree stumps because I and

  • everybody else use them for illustration.

  • All right.

  • We're making progress.

  • Now, what's the error rate for any these tests

  • and lines we drew?

  • Well, I guess it'll be the error rate is equal to the sum

  • of 1 over n--

  • That's the total number of points,

  • the number of samples--

  • summed over the cases where we are wrong.

  • So gee, we're going to work on combining some of these ideas.

  • And we've got this notion of exaggeration.

  • At some stage in what we're doing here, we're going to

  • want to be able to exaggerate the effect of some errors

  • relative to other errors.

  • So one thing we can do is we can assume, or we can

  • stipulate, or we can assert that each of these samples has

  • a weight associated with it.

  • That's W1, this is W2, and that's W3.

  • And in the beginning, there's no reason to suppose that any

  • one of these is more or less important

  • than any of the other.

  • So in the beginning, W sub i at time [? stub ?] one is

  • equal to 1 over n.

  • So the error is just adding up the number of samples that

  • were got wrong.

  • And that'll be the fraction of samples to that

  • you didn't get right.

  • And that will be the error rate.

  • So what we want to do is we want to say, instead of using

  • this as the error rate for all time, what we want to do is we

  • want to move that over, and say that the error rate is

  • equal to the sum over the things you got wrong in the

  • current step, times the weights of those

  • that were got wrong.

  • So in step one, everything's got the same weight, it

  • doesn't matter.

  • But if we find a way to change their weights going

  • downstream--

  • so as to, for example, highly exaggerate that third sample,

  • then W3 will go up relative to W1 and W2.

  • The one thing we want to be sure of is there is no matter

  • how we adjust the weights, that the sum of the weights

  • over the whole space is equal to 1.

  • So in other words, we want to choose the weights so that

  • they emphasize some of the samples, but we also want to

  • put a constraint on the weights such that all of them

  • added together is summing to one.

  • And we'll say that that enforces a distribution.

  • A distribution is a set of weights that sum to one.

  • Well, that's just a nice idea.

  • So we're make a little progress.

  • We've got this idea that we can add some plus/minus 1

  • classifiers together, you get a better classifier.

  • We got some idea about how to do that.

  • It occurs to us that maybe we want to get a lot of

  • classifiers into the act somehow or another.

  • And maybe we want to think about using decision tree

  • stumps so as to ground out thinking about all this stuff.

  • So the next step is to say, well, how actually should we

  • combine this stuff?

  • And you will find, in the literature libraries, full of

  • papers that do stuff like that.

  • And that was state of the art for quite a few years.

  • But then people began to say, well, maybe we can build up

  • this classifier, H of x, in multiple steps and get a lot

  • of classifiers into the act.

  • So maybe we can say that the classifier is the sign of H--

  • that's the one we picked first.

  • That's the classifier we picked first.

  • That's looking at samples.

  • And then we've got H2.

  • And then we've got H3.

  • And then we've got how many other classifiers we might

  • want, or how many classifiers we might need in order to

  • correctly classify everything in our sample set.

  • So people began to think about whether there might be an

  • algorithm that would develop a classifier that way,

  • one step at a time.

  • That's why I put that step number in the exponent,

  • because we're picking this one at first, then we're expanding

  • it to have two, and then we're expanding it to have

  • three, and so on.

  • And each of those individual classifiers are separately

  • looking at the sample.

  • But of course, it would be natural to suppose that just

  • adding things up wouldn't be enough.

  • And it's not.

  • So it isn't too hard to invent the next idea, which is to

  • modify this thing just a little bit by doing what?

  • It looks almost like a scoring polynomial, doesn't it?

  • So what would we do to tart this up a little bit?

  • STUDENT: [INAUDIBLE].

  • PATRICK WINSTON: Come again?

  • Do what?

  • STUDENT: [INAUDIBLE].

  • PATRICK WINSTON: Somewhere out there someone's murmuring.

  • STUDENT: Add--

  • PATRICK WINSTON: Add weights!

  • STUDENT: --weights.

  • Yeah.

  • PATRICK WINSTON: Excellent.

  • Good idea.

  • So what we're going to do is we're going to have alphas

  • associated with each of these classifiers, and we're going

  • to determine if somebody can build that kind

  • formula to do the job.

  • So maybe I ought to modify this gold star idea before I

  • get too far downstream.

  • And we're not going to treat everybody in a crowd equally.

  • We're going to wait some of the opinions more than others.

  • And by the way, they're all going to make errors in

  • different parts of the space.

  • So maybe it's not the wisdom of even a weighted crowd, but

  • a crowd of experts.

  • Each of which is good at different parts of the space.

  • So anyhow, we've got this formula, and there are a few

  • things that one can say turn out.

  • But first, let's write down the an algorithm for what this

  • ought to look like.

  • Before I run out of space, I think I'll exploit the right

  • hand board here, and put the overall algorithm right here.

  • So we're going to start out by letting of all the weights at

  • time 1 be equal to 1 over n.

  • That's just saying that they're all equal in the

  • beginning, and they're equal to 1 over n.

  • And n is the number of samples.

  • And then, when I've got that, I want to

  • compute alpha, somehow.

  • Let's see.

  • No, I don't want to do that.

  • I want to

  • I want to pick a classifier the minimizes the error rate.

  • And then m, i, zes, error at time t.

  • And that's going to be at time t.

  • And we're going to come back in here.

  • That's why we put a step index in there.

  • So once we've picked a classifier that produces an

  • error rate, then we can use the error rate to

  • determine the alpha.

  • So I want the alpha over here.

  • That'll be sort of a byproduct of picking that test.

  • And with all that stuff in hand, maybe that will be

  • enough to calculate Wt plus 1.

  • So we're going to use that classifier that we just picked

  • to get some revised weights, and then we're going to go

  • around that loop until this classifier produces a perfect

  • set of conclusions on all the sample data.

  • So that's going to be our overall strategy.

  • Maybe we've got, if we're going to number these things,

  • that's the fourth big idea.

  • And this arrangement here is the fifth big idea.

  • Then we've got the sixth big idea.

  • And the sixth big idea says this.

  • Suppose that the weight on it ith sample at time t plus 1 is

  • equal to the weight at time t on that same sample, divided

  • by some normalizing factor, times e to the minus alpha at

  • time t, times h at time t, times some function y which is

  • a function of x, But not a function of time.

  • Now you say, where did this come from?

  • And the answer is, it did not spring from the heart of

  • mathematician in the first 10 minutes that he

  • looked at this problem.

  • In fact, when I asked [INAUDIBLE]

  • how this worked, he said, well, he was thinking about

  • this on the couch every Saturday for about a year, and

  • his wife was getting pretty sore, but he finally found it

  • and saved their marriage.

  • So where does stuff like this come from?

  • Really, it comes from knowing a lot of mathematics, and

  • seeing a lot of situations, and knowing that something

  • like this might be mathematically convenient.

  • Something like this might be mathematically convenient.

  • But we've got to back up a little and let it sing to us.

  • What's y?

  • We saw y last time.

  • The support vector machines.

  • That's just a function.

  • That's plus 1 or minus 1, depending on whether the

  • output ought to be plus 1 or minus 1.

  • So if this guy is giving the correct answer, and the

  • correct answer is plus, and then this guy will be plus 1

  • too, because it always gives you the correct answer.

  • So in that case, where this guy is giving the right

  • answer, these will have the same sign, so that will be a

  • plus 1 combination.

  • On the other hand, if that guy's giving the wrong answer,

  • you're going to get a minus 1 out of that combination.

  • So it's true even if the right answer should be minus, right?

  • So if the right answer should be minus, and this is plus,

  • then this will be minus 1, and the whole combination well

  • give you minus 1 again.

  • In other words, the y just flips the sign if you've got

  • the wrong answer, no matter whether the wrong answer is

  • plus 1 or minus 1.

  • These alphas--

  • shoot, those are the same alphas that are in this

  • formula up here, somehow.

  • And then that z, what's that for?

  • Well, if you just look at the previous weights, and its

  • exponential function to produce these W's for the next

  • generation, that's not going to be a distribution, because

  • they won't sum up to 1.

  • So what this thing here, this z is, that's a sort of

  • normalizer.

  • And that makes that whole combination of new

  • weights add up to 1.

  • So it's whatever you got by adding up all those guys, and

  • then dividing by that number.

  • Well, phew.

  • I don't know.

  • Now there's some it-turns-out-thats.

  • We're going to imagine that somebody's done the same sort

  • of thing we did to the support vector machines.

  • We're going to find a way to minimize the error.

  • And the error we're going to minimize is the error produced

  • by that whole thing up there in 4.

  • We're going to minimize the error of that entire

  • expression as we go along.

  • And what we discover when we do the appropriate

  • differentiations and stuff--

  • you know, that's what we do in calculus--

  • what we discover is that you get minimum error for the

  • whole thing if alpha is equal to 1 minus the error rate at

  • time t, divided by the error rate at time t.

  • Now let's take the logarithm of that, and

  • multiply it by half.

  • And that's what [INAUDIBLE]

  • was struggling to find.

  • But we haven't quite got it right.

  • And so let me add this in separate chunks, so we don't

  • get confused about this.

  • It's a bound on that expression up there.

  • It's a bound on the error rate produced by that expression.

  • So interestingly enough, this means that the error rate can

  • actually go up as you add terms to this formula.

  • all you know is that the error rate is going to be bounded by

  • an exponentially decaying function.

  • So it's eventually guaranteed to converge on zero.

  • So it's a minimal error bound.

  • It turns out to be exponential.

  • Well, there it is.

  • We're done.

  • Would you like to see a demonstration?

  • Yeah, OK.

  • Because you look at that, and you say, well, how could

  • anything like that possibly work?

  • And the answer is, surprisingly enough, here's

  • what happens.

  • There's a simple little example.

  • So that's the first test chosen.

  • the greens are pluses and the reds are minuses, so it's

  • still got an error.

  • Still got an error-- boom.

  • There, in two steps.

  • It now has--

  • we can look in the upper right hand corner--

  • we see its used three classifiers, and we see that

  • one of those classifiers says that everybody belongs to a

  • particular class, three different weights.

  • And the error rate has converged to 0.

  • So let's look at a couple of other ones.

  • Here is the one I use for debugging this thing.

  • We'll let that run.

  • See how fast it is?

  • Boom.

  • It converges to getting all the samples right very fast.

  • Here's another one.

  • This is one we gave on an exam a few years back.

  • First test.

  • Oh, I let it run, so it got everything

  • instantaneously right.

  • Let's take that through step at a time.

  • There's the first one, second one.

  • Still got a lot of errors.

  • Ah, the error rate's dropping.

  • And then flattened, flattened, and it goes to 0.

  • Cool, don't you think?

  • But you say to me, bah, who cares about that stuff?

  • Let's try something more interesting.

  • There's one.

  • That was pretty fast, too.

  • Well, there's not too many samples here.

  • So we can try this.

  • So there's an array of pluses and minuses.

  • Boom.

  • You can see how that error rate is bounded by an

  • exponential?

  • So in a bottom graph, you've got the number of classifiers

  • involved, and that goes up to a total, eventually, of 10.

  • You can see how positive or negative each of the

  • classifiers that's added is by looking at

  • this particular tab.

  • And this just shows how they evolve over time.

  • But the progress thing here is the most interesting.

  • And now you say to me, well, how did the machine do that?

  • And it's all right here.

  • We use an alpha that looks like this.

  • And that allows us to compute the new weights.

  • It says we've got a preliminary calculation.

  • We've got to find a z that does the normalization.

  • And we sure better bring our calculator, because we've got,

  • first of all, to calculate the error rate.

  • Then we've got to take its logarithm, divide by 2, plug

  • it into that formula, take the exponent, and that gives us

  • the new weight.

  • And that's how the program works.

  • And if you try that, I guarantee you

  • will flunk the exam.

  • Now, I don't care about my computer.

  • I really don't.

  • It's a slave, and it can calculate these logarithm and

  • exponentials till it turns blue, and I don't care.

  • Because I've got four cores or something, and who cares.

  • Might as well do this, than sit around

  • just burning up heat.

  • But you don't want to do that.

  • So what you want to do is you want to know how to do this

  • sort of thing more expeditiously.

  • So we're going to have to let them the math sing to us a

  • little bit, with a view towards finding better ways of

  • doing this sort of thing.

  • So let's do that.

  • And we're going to run out of space here before long, so let

  • me reclaim as much of this board as I can.

  • So what I'm going to do is I'm going to say, well, now that

  • we've got this formula for alpha that relates alpha t to

  • the error, then I can plug that into this formula up

  • here, number 6.

  • And what I'll get is that the weight of t plus 1 is equal to

  • the weight at t divided by that normalizing factor,

  • multiplied times something that depends on whether it's

  • categorized correctly or not.

  • That's what that y's in their for, right?

  • So we've got a logarithm here, and we got a sign flipper up

  • there in terms of that H of x and y combination.

  • So if the sign of that whole thing at minus alpha and that

  • y H combination turns out to be negative, then we're going

  • to have to flip the numerator and denominator here in this

  • logarithm, right?

  • And oh, by the way, since we've got a half out here,

  • that turns out to be the square root of that term

  • inside the logarithm.

  • So when we carefully do that, what we discover is that it

  • depends on whether it's the right thing or not.

  • But what it turns out to be is something like a multiplier of

  • the square root.

  • Better be careful, here.

  • The square root of what?

  • STUDENT: [INAUDIBLE].

  • PATRICK WINSTON: Well, let's see.

  • But we have to be careful.

  • So let's suppose that this is 4 things that we get correct.

  • So if we get it correct, then we're going to get the same

  • sign out of H of x and y.

  • We've get a minus sign out there, so we're going to flip

  • the numerator and denominator.

  • So we're going to get the square root of e of t over 1

  • minus epsilon of t if that's correct.

  • If it's wrong, it'll just be the flip of that.

  • So it'll be the square root of 1 minus the error rate over

  • the error rate.

  • Everybody with me on that?

  • I think that's right.

  • If it's wrong, I'll have to hang myself and wear a paper

  • bag over my head like I did last year.

  • But let's see if we can make this go correctly this time.

  • So now, we've got this guy here, we've got everything

  • plugged in all right, and we know that now this z ought to

  • be selected so that it's equal to the sum of this guy

  • multiplied by these things as appropriate for whether it's

  • correct or not.

  • Because we want, in the end, for all of these w's

  • to add up to 1.

  • So let's see what they add up to without the z there.

  • So what we know is that it must be the case that if we

  • add over the correct ones, we get the square root of the

  • error rate over 1 minus the rate of the Wt plus 1.

  • Plus now we've got the sum of 1 minus the error rate over

  • the error rate, times the sum of the Wi at time t for wrong.

  • So that's what we get if we added all these

  • up without the z.

  • So since everything has to add up to 1, then z ought to be

  • equal to this sum.

  • That looks pretty horrible, until we realize that if we

  • add these guys up over the weights that are wrong, that

  • is the error rate.

  • This is e.

  • So therefore, z is equal the square root of the error rate

  • times 1 minus the error rate.

  • That's the contribution of this term.

  • Now, let's see.

  • What is the sum of the weights over the

  • ones that are correct?

  • Well, that must be 1 minus the error rate.

  • Ah, so this thing gives you the same result as this one.

  • So z is equal to 2 times that.

  • And that's a good thing.

  • Now we are getting somewhere.

  • Because now, it becomes a little bit easier to write

  • some things down.

  • Well, we're way past this, so let's get rid of this.

  • And now we can put some things together.

  • Let me point out what I'm putting together.

  • I've got an expression for z right here.

  • And I've got an expression for the new w's here.

  • So let's put those together and say that w of t plus 1 is

  • equal to w of t.

  • I guess we're going to divide that by 2.

  • And then we've got this square root times that expression.

  • So if we take that correct one, and divide by that one,

  • then the [INAUDIBLE]

  • cancel out, and I get 1 over 1 minus the error rate.

  • That's it.

  • That's correct.

  • And if it's not correct, then it's Wt over 2--

  • and working through the math--

  • 1 over epsilon, if wrong.

  • Do we feel like we're making any progress?

  • No.

  • Because we haven't let it sing to us enough yet.

  • So I want to draw your attention to what happens to

  • amateur rock climbers when they're halfway

  • up a difficult cliff.

  • They're usually [INAUDIBLE], sometimes they're not.

  • If they're not, they're scared to death.

  • And every once in a while, as they're just about to fall,

  • they find some little tiny hole to stick a fingernail in,

  • and that keeps them from falling.

  • That's called a thank-god hole.

  • So what I'm about to introduce is the analog of those little

  • places where you can stick your fingernail in.

  • It's the thank-god hole for dealing

  • with boosting problems.

  • So what happens if I add all these [? Wi ?]

  • up for the ones that the classifier where produces a

  • correct answer on?

  • Well, it'll be 1 over 2, and 1 over 1 minus epsilon, times

  • the sum of the Wt for which the answer was correct.

  • What's this sum?

  • Oh!

  • My goddess.

  • 1 minus epsilon.

  • So what I've just discovered is that if I sum new w's over

  • those samples for which I got a correct answer,

  • it's equal to 1/2.

  • And guess what?

  • That means that if I sum them over wrong, it's equal to 1/2

  • half as well.

  • So that means that I take all of the weight for which I got

  • the right answer with the previous test, and those ways

  • will add up to something.

  • And to get the weights for the next generation, all I have to

  • do is scale them so that they equal half.

  • This was not noticed by the people who

  • developed this stuff.

  • This was noticed by Luis Ortiz, who was a 6.034

  • instructor a few years ago.

  • The sum of those weights is going to be a scaled version

  • of what they were before.

  • So you take all the weights for which this new

  • classifier--

  • this one you selected to give you the minimum weight on the

  • re-weighted stuff--

  • you take the ones that it gives a correct answer for,

  • and you take all of those weights, and you just scale

  • them so they add up to 1/2.

  • So do you have to compute any logarithms?

  • No.

  • Do you have to compute any exponentials?

  • No.

  • Do you have to calculate z?

  • No.

  • Do you have to calculate alpha to get the new weights?

  • No.

  • All you have to do is scale them.

  • And that's a pretty good thank-god hole.

  • So that's thank-god hole number one.

  • Now, for thank-god hole number two, we need to go back and

  • think about the fact that were going to give you problems in

  • probability that involve decision tree stumps.

  • And there are a lot of decision tree stumps that you

  • might have to pick from.

  • So we need a thank-god hole for deciding how

  • to deal with that.

  • Where can I find some room?

  • How about right here.

  • Suppose you've got a space that looks like this.

  • I'm just makings this up at random.

  • So how many--

  • let's see.

  • 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11.

  • How many tests do I have to consider in that dimension?

  • 11.

  • It's 1 plus the number of samples.

  • That would be horrible.

  • I don't know.

  • Do I have actually calculate this one?

  • How could that possibly be better than that one?

  • It's got one more thing wrong.

  • So that one makes sense.

  • The other one doesn't make sense.

  • So in the end, no test that lies between two correctly

  • classified samples will ever be any good.

  • So that one's a good guy, and that one's a good guy.

  • And this one's a bad guy.

  • Bad guy, bad guy bad guy, bad guy.

  • Bad guy, bad guy, bad buy.

  • So the actual number of tests you've got is three.

  • And likewise, in the other dimension--

  • well, I haven't drawn it so well here, but would this test

  • be a good one?

  • No.

  • That one?

  • No.

  • Actually, I'd better look over here on the right and see what

  • I've got before I draw too many conclusions.

  • Let's look over this, since I don't want to think too hard

  • about what's going on in the other dimension.

  • But the idea is that very few of those

  • tests actually matter.

  • Now, you say to me, there's one last thing.

  • What about overfitting?

  • Because all this does is drape a solution over the samples.

  • And like support vector machines overfit, neural maps

  • overfit, identification trees overfit.

  • Guess what?

  • This doesn't seem to overfit.

  • That's an experimental result for which the

  • literature is confused.

  • It goes back to providing an explanation.

  • So this stuff is tried on all sorts of problems, like

  • handwriting recognition, understanding speech, all

  • sorts of stuff uses boosting.

  • And unlike other methods, for some reason as yet imperfectly

  • understood, it doesn't seem to overfit.

  • But in the end, they leave no stone unturned in 6.034.

  • Every time we do this, we do some additional experiments.

  • So here's a sample that I'll leave you with.

  • Here's a situation in which we have a 10-dimensional space.

  • We've made a fake distribution, and then we put

  • in that boxed outlier.

  • That was just put into the space at random, so it can be

  • viewed as an error point.

  • So now what we're going to do is we're going to see what

  • happens when we run that guy.

  • And sure enough, in 17 steps, it finds a solution.

  • But maybe it's overfit that little guy who's an error.

  • But one thing you can do is you can say, well, all of

  • these classifiers are dividing this space up into chunks, and

  • we can compute the size of the space occupied by any sample.

  • So one thing we can do--

  • alas, I'll have to get up a new demonstration.

  • One thing we can do, now that this guy's over here, we can

  • switch the volume tab and watch how the volume occupied

  • by that error point evolves as we solve the problem.

  • So look what happens.

  • This is, of course, randomly generated.

  • I'm counting on this working.

  • Never failed before.

  • So it originally starts out as occupying 26%

  • of the total volume.

  • It ends up occupying 1.4 times 10 to the

  • minus 3rd% of the volume.

  • So what tends to happen is that these decision tree

  • stumps tend to wrap themselves so tightly around the error

  • points, there's no room for overfitting, because nothing

  • else will fit in that same volume.

  • So that's why I think that this thing tends to produce

  • solutions which don't overfit.

  • So in conclusion, this is magic.

  • You always want to use it.

  • It'll work with any kind of [? speed ?] of

  • classifiers you want.

  • And you should understand it very thoroughly, because of

  • anything is useful in the subject in dimension learning,

  • this is it.

PATRICK WINSTON: We've now almost completed our journey.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

A2 初級 美國腔

17.學習。提高 (17. Learning: Boosting)

  • 88 11
    Ren Hau Gu 發佈於 2021 年 01 月 14 日
影片單字