回放2｜影像分類 (Lecture 2 | Image Classification)

字幕列表影片播放

Okay, so welcome to lecture two of CS231N.
On Tuesday we, just recall, we, sort of, gave you
the big picture view of what is computer vision,
what is the history,
and a little bit of the overview of the class.
And today, we're really going to dive in, for the first time,
into the details.
And we'll start to see, in much more depth,
exactly how some of these learning algorithms
actually work in practice.
So, the first lecture of the class
is probably, sort of, the largest big picture vision.
And the majority of the lectures in this class
will be much more detail orientated,
much more focused on the specific mechanics,
of these different algorithms.
So, today we'll see our first learning algorithm
and that'll be really exciting, I think.
But, before we get to that,
I wanted to talk about a couple of administrative issues.
One, is Piazza.
So, I saw it when I checked yesterday,
it seemed like we had maybe 500 students
signed up on Piazza.
Which means that there are several hundred of you
who are not yet there.
So, we really want Piazza to be the main source
of communication between the students and the core staff.
So, we've gotten a lot of questions to the staff list
about project ideas or questions about midterm attendance
or poster session attendance.
And, any, sort of, questions like that
should really go to Piazza.
You'll probably get answers to your questions faster
on Piazza, because all the TAs are knowing to check that.
And it's, sort of, easy for emails to get lost
in the shuffle if you just send to the course list.
It's also come to my attention that some SCPD students
are having a bit of a hard time signing up for Piazza.
SCPD students are supposed to receive a
@stanford.edu email address.
So, once you get that email address,
then you can use the Stanford email to sign into Piazza.
Probably that doesn't affect those of you who are
sitting in the room right now,
but, for those students listening on SCPD.
The next administrative issue is about assignment one.
Assignment one will be up later today,
probably sometime this afternoon,
but I promise, before I go to sleep tonight,
it'll be up.
But, if you're getting a little bit antsy
and really want to start working on it right now,
then you can look at last year's version
of assignment one.
It'll be pretty much the same content.
We're just reshuffling it a little bit to make it,
like, for example, upgrading to work with Python 3,
rather than Python 2.7.
And some of these minor cosmetic changes,
but the content of the assignment will still be the same
as last year.
So, in this assignment you'll be implementing your own
k-nearest neighbor classifier,
which we're going to talk about in this lecture.
You'll also implement several different linear classifiers,
including the SVM and Softmax,
as well as a simple two-layer neural network.
And we'll cover all this content
over the next couple of lectures.
So, all of our assignments are using Python and NumPy.
If you aren't familiar with Python or NumPy,
then we have written a tutorial that you can find
on the course website to try and get you up to speed.
But, this is, actually, pretty important.
NumPy lets you write these very efficient vectorized
operations that let you do quite a lot of computation
in just a couple lines of code.
So this is super important for pretty much
all aspects of numerical computing and machine learning
and everything like that,
is efficiently implementing these vectorized operations.
And you'll get a lot of practice with this
on the first assignment.
So, for those of you who don't have a lot of experience
with Matlab or NumPy or other types of vectorized
tensor computation, I recommend that you start looking
at this assignment pretty early
and also, read carefully through the tutorial.
The other thing I wanted to talk about
is that we're happy to announce that
we're officially supported through Google Cloud
for this class.
So, Google Cloud is somewhat similar to Amazon AWS.
You can go and start virtual machines up in the cloud.
These virtual machines can have GPUs.
We're working on the tutorial for exactly how to use
Google Cloud and get it to work for the assignments.
But our intention is that you'll be able to just download
some image, and it'll be very seamless
for you to work on the assignments
on one of these instances on the cloud.
And because Google has, very generously,
supported this course,
we'll be able to distribute to each of you
coupons that let you use Google Cloud credits for free
for the class.
So you can feel free to use these for the assignments
and also for the course projects
when you want to start using GPUs and larger machines
and whatnot.
So, we'll post more details about that,
probably, on Piazza later today.
But, I just wanted to mention,
because I know there had been a couple of questions
about, can I use my laptop?
Do I have to run on corn?
Do I have to, whatever?
And the answer is that, you'll be able to run on
Google Cloud and we'll provide you some coupons for that.
Yeah, so,
those are, kind of, the major administrative issues
I wanted to talk about today.
And then, let's dive into the content.
So, the last lecture we talked a little bit
about this task of image classification,
which is really a core task in computer vision.
And this is something that we'll really focus on
throughout the course of the class.
Is, exactly,
how do we work on this image classification task?
So, a little bit more concretely,
when you're doing image classification,
your system receives some input image,
which is this cute cat in this example,
and the system is aware of some predetermined set
of categories or labels.
So, these might be, like, a dog or a cat or a truck
or a plane, and there's some fixed set of category labels,
and the job of the computer is to look at the picture
and assign it one of these fixed category labels.
This seems like a really easy problem,
because so much of your own visual system in your brain
is hardwired to doing these, sort of,
visual recognition tasks.
But this is actually a really, really hard problem
for a machine.
So, if you dig in and think about, actually,
what does a computer see when it looks at this image,
it definitely doesn't get this holistic idea of a cat
that you see when you look at it.
And the computer really is representing the image
as this gigantic grid of numbers.
So, the image might be something like 800 by 600 pixels.
And each pixel is represented by three numbers,
giving the red, green, and blue values for that pixel.
So, to the computer,
this is just a gigantic grid of numbers.
And it's very difficult to distill the cat-ness
out of this, like, giant array of thousands, or whatever,
very many different numbers.
So, we refer to this problem as the semantic gap.
This idea of a cat, or this label of a cat,
is a semantic label that we're assigning to this image,
and there's this huge gap between the semantic idea of a cat
and these pixel values that the computer is actually seeing.
And this is a really hard problem because
you can change the picture in very small, subtle ways
that will cause this pixel grid to change entirely.
So, for example, if we took this same cat,
and if the cat happened to sit still
and not even twitch, not move a muscle,
which is never going to happen,
but we moved the camera to the other side,
then every single grid, every single pixel,
in this giant grid of numbers
would be completely different.
But, somehow, it's still representing the same cat.
And our algorithms need to be robust to this.
But, not only viewpoint is one problem,
another is illumination.
There can be different lighting conditions going on
in the scene.
Whether the cat is appearing in this very dark, moody scene,
or like is this very bright, sunlit scene, it's still a cat,
and our algorithms need to be robust to that.
Objects can also deform.
I think cats are, maybe, among the more deformable
of animals that you might see out there.
And cats can really assume a lot of different, varied poses
and positions.
And our algorithms should be robust to these different
kinds of transforms.
There can also be problems of occlusion,
where you might only see part of a cat, like, just the face,
or in this extreme example, just a tail peeking out
from under the couch cushion.
But, in these cases, it's pretty easy for you, as a person,
to realize that this is probably a cat,
and you still recognize these images as cats.
And this is something that our algorithms
also must be robust to,
which is quite difficult, I think.
There can also be problems of background clutter,
where maybe the foreground object of the cat,
could actually look quite similar in appearance
to the background.
And this is another thing that we need to handle.
There's also this problem of intraclass variation,
that this one notion of cat-ness, actually spans a lot of
different visual appearances.
And cats can come in different shapes and sizes
and colors and ages.
And our algorithm, again, needs to work
and handle all these different variations.
So, this is actually a really, really challenging problem.
And it's sort of easy to forget how easy this is
because so much of your brain is specifically tuned
for dealing with these things.
But now if we want our computer programs
to deal with all of these problems, all simultaneously,
and not just for cats, by the way,
but for just about any object category you can imagine,
this is a fantastically challenging problem.
And it's, actually, somewhat miraculous
that this works at all, in my opinion.
But, actually, not only does it work,
but these things work very close to human accuracy
in some limited situations.
And take only hundreds of milliseconds to do so.
So, this is some pretty amazing, incredible technology,
in my opinion, and over the course of the rest of the class
we will really see what kinds of advancements
have made this possible.
So now, if you, kind of, think about
what is the API for writing an image classifier,
you might sit down and try to write a method in Python
like this.
Where you want to take in an image
and then do some crazy magic
and then, eventually, spit out this class label
to say cat or dog or whatnot.
And there's really no obvious way to do this, right?
If you're taking an algorithms class
and your task is to sort numbers
or compute a convex hull
or, even, do something like RSA encryption,
you, sort of, can write down an algorithm
and enumerate all the steps that need to happen
in order for this things to work.
But, when we're trying to recognize objects,
or recognize cats or images,
there's no really clear, explicit algorithm
that makes intuitive sense,
for how you might go about recognizing these objects.
So, this is, again, quite challenging,
if you think about,
if it was your first day programming
and you had to sit down and write this function,
I think most people would be in trouble.
That being said,
people have definitely made explicit attempts
to try to write, sort of, high-end coded rules
for recognizing different animals.
So, we touched on this a little bit in the last lecture,
but maybe one idea for cats is that,
we know that cats have ears and eyes and mouths and noses.
And we know that edges, from Hubel and Wiesel,
we know that edges are pretty important
when it comes to visual recognition.
So one thing we might try to do is
compute the edges of this image
and then go in and try to categorize all the different
corners and boundaries, and say that, if we have maybe
three lines meeting this way, then it might be a corner,
and an ear has one corner here and one corner there
and one corner there,
and then, kind of, write down this explicit set of rules
for recognizing cats.
But this turns out not to work very well.
One, it's super brittle.
And, two, say, if you want to start over for another
object category, and maybe not worry about cats,
but talk about trucks or dogs or fishes or something else,
then you need to start all over again.
So, this is really not a very scalable approach.
We want to come up with some algorithm, or some method,
for these recognition tasks
which scales much more naturally to all the variety
of objects in the world.
So, the insight that, sort of, makes this all work
is this idea of the data-driven approach.
Rather than sitting down and writing these hand-specified
rules to try to craft exactly what is a cat or a fish
or what have you,
instead, we'll go out onto the internet
and collect a large dataset of many, many cats
and many, many airplanes and many, many deer
and different things like this.
And we can actually use tools like Google Image Search,
or something like that,
to go out and collect a very large number of examples
of these different categories.
By the way, this actually takes quite a lot of effort
to go out and actually collect these datasets
but, luckily, there's a lot of really good, high quality
datasets out there already for you to use.
Then once we get this dataset,
we train this machine learning classifier
that is going to ingest all of the data,
summarize it in some way,
and then spit out a model
that summarizes the knowledge of how to recognize
these different object categories.
Then finally, we'll use this training model
and apply it on new images
that will then be able to recognize
cats and dogs and whatnot.
So here our API has changed a little bit.
Rather than a single function
that just inputs an image and recognizes a cat,
we have these two functions.
One, called, train, that's going to input images and labels
and then output a model,
and then, separately, another function called, predict,
which will input the model and than make predictions
for images.
And this is, kind of, the key insight
that allowed all these things to start working really well
over the last 10, 20 years or so.
So, this class is primarily about neural networks
and convolutional neural networks
and deep learning and all that,
but this idea of a data-driven approach is much more general
than just deep learning.
And I think it's useful to, sort of,
step through this process
for a very simple classifier first,
before we get to these big, complex ones.
So, probably, the simplest classifier you can imagine
is something we call nearest neighbor.
The algorithm is pretty dumb, honestly.
So, during the training step we won't do anything,
we'll just memorize all of the training data.
So this is very simple.
And now, during the prediction step,
we're going to take some new image
and go and try to find the most similar image
in the training data to that new image,
and now predict the label of that most similar image.
A very simple algorithm.
But it, sort of, has a lot of these nice properties
with respect to data-drivenness and whatnot.
So, to be a little bit more concrete,
you might imagine working on this dataset called CIFAR-10,
which is very commonly used in machine learning,
as kind of a small test case.
And you'll be working with this dataset on your homework.
So, the CIFAR-10 dataset gives you 10 different classes,
airplanes and automobiles and birds and cats and different
things like that.
And for each of those 10 categories
it provides 50,000 training images,
roughly evenly distributed across these 10 categories.
And then 10,000 additional testing images
that you're supposed to test your algorithm on.
So here's an example of applying this simple
nearest neighbor classifier to some of these test images
on CIFAR-10.
So, on this grid on the right,
for the left most column,
gives a test image in the CIFAR-10 dataset.
And now on the right, we've sorted the training images
and show the most similar training images
to each of these test examples.
And you can see that they look kind of visually similar
to the training images,
although they are not always correct, right?
So, maybe on the second row, we see that the testing,
this is kind of hard to see,
because these images are 32 by 32 pixels,
you need to really dive in there
and try to make your best guess.
But, this image is a dog and it's nearest neighbor is also
a dog, but this next one, I think is actually a deer
or a horse or something else.
But, you can see that it looks quite visually similar,
because there's kind of a white blob in the middle
and whatnot.
So, if we're applying the nearest neighbor algorithm
to this image,
we'll find the closest example in the training set.
And now, the closest example, we know it's label,
because it comes from the training set.
And now, we'll simply say that this testing image is also
a dog.
You can see from these examples that is probably not
going to work very well,
but it's still kind of a nice example to work through.
But then, one detail that we need to know is,
given a pair of images,
how can we actually compare them?
Because, if we're going to take our test image and compare it
to all the training images,
we actually have many different choices
for exactly what that comparison function should look like.
So, in the example in the previous slide,
we've used what's called the L1 distance,
also sometimes called the Manhattan distance.
So, this is a really sort of simple, easy idea
for comparing images.
And that's that we're going to just compare individual pixels
in these images.
So, supposing that our test image is maybe just a tiny
four by four image of pixel values,
then we're take this upper-left hand pixel
of the test image,
subtract off the value in the training image,
take the absolute value,
and get the difference in that pixel between the two images.
And then, sum all these up across all the pixels
in the image.
So, this is kind of a stupid way to compare images,
but it does some reasonable things sometimes.
But, this gives us a very concrete way
to measure the difference between two images.
And in this case, we have this difference of 456
between these two images.
So, here's some full Python code
for implementing this nearest neighbor classifier
and you can see it's pretty short and pretty concise
because we've made use of many of these vectorized
operations offered by NumPy.
So, here we can see that this training function,
that we talked about earlier,
is, again, very simple, in the case of nearest neighbor,
you just memorize the training data,
there's not really much to do here.
And now, at test time, we're going to take in our image
and then go in and compare using this L1 distance function,
our test image to each of these training examples
and find the most similar example in the training set.
And you can see that, we're actually able to do this
in just one or two lines of Python code
by utilizing these vectorized operations in NumPy.
So, this is something that you'll get practice with
on the first assignment.
So now, a couple questions about this simple classifier.
First, if we have N examples in our training set,
then how fast can we expect training and testing to be?
Well, training is probably constant
because we don't really need to do anything,
we just need to memorize the data.
And if you're just copying a pointer,
that's going to be constant time
no matter how big your dataset is.
But now, at test time we need to do this comparison stop
and compare our test image
to each of the N training examples in the dataset.
And this is actually quite slow.
So, this is actually somewhat backwards,
if you think about it.
Because, in practice,
we want our classifiers to be slow at training time
and then fast at testing time.
Because, you might imagine, that a classifier might go
and be trained in a data center somewhere
and you can afford to spend a lot of computation
at training time to make the classifier really good.
But then,
when you go and deploy the classifier at test time,
you want it to run on your mobile phone
or in a browser or some other low power device,
and you really want the testing time performance
of your classifier to be quite fast.
So, from this perspective, this nearest neighbor algorithm,
is, actually, a little bit backwards.
And we'll see that once we move to
convolutional neural networks,
and other types of parametric models,
they'll be the reverse of this.
Where you'll spend a lot of compute at training time,
but then they'll be quite fast at testing time.
So then, the question is,
what exactly does this nearest neighbor algorithm
look like when you apply it in practice?
So, here we've drawn, what we call the decision regions
of a nearest neighbor classifier.
So, here our training set consists of these points
in the two dimensional plane,
where the color of the point represents the category,
or the class label, of that point.
So, here we see we have five classes
and some blue ones up in the corner here,
some purple ones in the upper-right hand corner.
And now for each pixel in this entire plane,
we've gone and computed what is the nearest example
in these training data,
and then colored the point of the background
corresponding to what is the class label.
So, you can see that this nearest neighbor classifier
is just sort of carving up the space
and coloring the space according to the nearby points.
But this classifier is maybe not so great.
And by looking at this picture
we can start to see some of the problems that might come out
with a nearest neighbor classifier.
For one, this central region actually contains
mostly green points,
but one little yellow point in the middle.
But because we're just looking at the nearest neighbor,
this causes a little yellow island to appear
in this middle of this green cluster.
And that's, maybe, not so great.
Maybe those points actually should have been green.
And then, similarly we also see these, sort of, fingers,
like the green region pushing into the blue region,
again, due to the presence of one point,
which may have been noisy or spurious.
So, this kind of motivates a slight generalization
of this algorithm called k-nearest neighbors.
So rather than just looking for the single nearest neighbor,
instead we'll do something a little bit fancier
and find K of our nearest neighbors,
according to our distance metric,
and then take a vote among each of our neighbors.
And then predict the majority vote
among our neighbors.
You can imagine slightly more complex ways of doing this.
Maybe you'd vote weighted on the distance,
or something like that,
but the simplest thing that tends to work pretty well
is just taking a majority vote.
So here we've shown the exact same set of points
using this K=1 nearest neighbor classifier,
as well as K=3 and K=5 in the middle and on the right.
And once we move to K=3, you can see that that spurious
yellow point in the middle of the green cluster
is no longer causing the points near that region
to be classified as yellow.
Now this entire green portion in the middle
is all being classified as green.
You can also see that these fingers
of the red and blue regions
are starting to get smoothed out
due to this majority voting.
And then, once we move to the K=5 case,
then these decision boundaries
between the blue and red regions
have become quite smooth and quite nice.
So, generally when you're using nearest neighbors
classifiers,
you almost always want to use some value of K,
which is larger than one
because this tends to smooth out your decision
boundaries and lead to better results.
Question?
[student asking a question]
Yes, so the question is,
what is the deal with these white regions?
The white regions are where there was no majority
among the k-nearest neighbors.
You could imagine maybe doing something slightly fancier
and maybe taking a guess or randomly selecting among
the majority winners,
but for this simple example we're just coloring it white
to indicate there was no nearest neighbor
in those points.
Whenever we're thinking about computer vision
I think it's really useful to kind of flip
back and forth between several different viewpoints.
One, is this idea of high dimensional points in the plane,
and then the other is actually looking at concrete images.
Because the pixels of the image actually
allow us to think of these images as high dimensional
vectors.
And it's sort of useful to ping pong back and forth
between these two different viewpoints.
So then, sort of taking this k-nearest neighbor
and going back to the images
you can see that it's actually not very good.
Here I've colored in red and green
which images would actually be classified correctly
or incorrectly according to their nearest neighbor.
And you can see that it's really not very good.
But maybe if we used a larger value of K
then this would involve actually voting among
maybe the top three or the top five
or maybe even the whole row.
And you could imagine that that would end up being
a lot more robust to some of this noise that we see
when retrieving neighbors in this way.
So another choice we have when we're working
with the k-nearest neighbor algorithm
is determining exactly how we should be comparing
our different points.
For the examples so far we've just shown
we've talked about this L1 distance
which takes the sum of the absolute values
between the pixels.
But another common choice is the L2 or Euclidean distance
where you take the square root of the sum of the squares
and take this as your distance.
Choosing different distance metrics actually
is a pretty interesting topic
because different distance metrics
make different assumptions about the underlying
geometry or topology that you'd expect in the space.
So, this L1 distance, underneath this, this is actually
a circle according to the L1 distance
and it forms this square shape thing
around the origin.
Where each of the points on this, on the square,
is equidistant from the origin according to L1,
whereas with the L2 or Euclidean distance
then this circle is a familiar circle,
it looks like what you'd expect.
So one interesting thing to point out between these two
metrics in particular,
is that the L1 distance depends on your choice
of coordinates system.
So if you were to rotate the coordinate frame
that would actually change the L1 distance
between the points.
Whereas changing the coordinate frame in the L2 distance
doesn't matter, it's the same thing no matter what
your coordinate frame is.
Maybe if your input features, if the individual entries
in your vector have some important meaning
for your task,
then maybe somehow L1 might be a more natural fit.
But if it's just a generic vector in some space
and you don't know which of the different elements,
you don't know what they actually mean,
then maybe L2 is slightly more natural.
And another point here is that
by using different distance metrics
we can actually generalize the k-nearest neighbor
classifier to many, many different types of data,
not just vectors, not just images.
So, for example, imagine you wanted to classify pieces
of text, then the only thing you need to do
to use k-nearest neighbors
is to specify some distance function
that can measure distances between maybe two paragraphs
or two sentences or something like that.
So, simply by specifying different distance metrics
we can actually apply this algorithm very generally
to basically any type of data.
Even though it's a kind of simple algorithm,
in general, it's a very good thing to try first
when you're looking at a new problem.
So then, it's also kind of interesting to think about
what is actually happening geometrically
if we choose different distance metrics.
So here we see the same set of points on the left
using the L1, or Manhattan distance,
and then, on the right, using the familiar L2,
or Euclidean distance.
And you can see that the shapes of these decision
boundaries actually change quite a bit
between the two metrics.
So when you're looking at L1 these decision boundaries
tend to follow the coordinate axes.
And this is again because the L1 depends on our choice
of coordinate system.
Where the L2 sort of doesn't really care about the
coordinate axis, it just puts the boundaries
where they should fall naturally.
My confession is that each of these examples
that I've shown you is actually from this interactive
web demo that I built,
where you can go and play with this k-nearest neighbor
classifier on your own.
And this is really hard to work on a projector screen.
So maybe we'll do that on your own time.
So, let's just go back to here.
Man, this is kind of embarrassing.
Okay, that was way more trouble than it was worth.
So, let's skip this, but I encourage you
to go play with this in your browser.
It's actually pretty fun
and kind of nice to build intuition about
how the decision boundary changes
as you change the K
and change your distance metric
and all those sorts of things.
Okay, so then the question is
once you're actually trying to use this algorithm
in practice, there's several choices
you need to make.
We talked about choosing different values of K.
We talked about choosing different distance metrics.
And the question becomes
how do you actually make these choices for your problem
and for your data?
So, these choices, of things like K and the distance metric,
we call hyperparameters,
because they are not necessarily learned from the training
data,
instead these are choices about your algorithm that you make
ahead of time
and there's no way to learn them directly from the data.
So, the question is how do you set these things
in practice?
And they turn out to be very problem-dependent.
And the simple thing that most people do is simply
try different values of hyperparameters for your data
and for your problem, and figure out which one works best.
There's a question?
[student asking a question]
So, the question is, where L1 distance might be preferable
to using L2 distance?
I think it's mainly problem-dependent,
it's sort of difficult to say
in which cases you think one might be better
than the other.
but I think that because L1 has this sort of coordinate
dependency, it actually depends on the coordinate system
of your data,
if you know that you have a vector,
and maybe the individual elements of the vector
have meaning.
Like maybe you're classifying employees for some reason
and then the different elements of that vector correspond
to different features or aspects of an employee.
Like their salary or the number of years they've been
working at the company or something like that.
So I think when your individual elements actually
have some meaning,
is where I think maybe using L1 might make a little bit
more sense.
But in general, again, this is a hyperparameter
and it really depends on your problem and your data
so the best answer is just to try them both
and see what works better.
Even this idea of trying out different values
of hyperparameters and seeing what works best,
there are many different choices here.
What exactly does it mean to try hyperparameters
and see what works best?
Well, the first idea you might think of
is simply choosing the hyperparameters that give you
the best accuracy or best performance
on your training data.
This is actually a really terrible idea.
You should never do this.
In the concrete case of the nearest neighbor
classifier, for example,
if we set K=1, we will always classify the training data
perfectly.
So if we use this strategy we'll always pick K=1,
but, as we saw from the examples earlier,
in practice it seems that setting K equals to larger values
might cause us to misclassify some of the training data,
but, in fact, lead to better performance
on points that were not in the training data.
And ultimately in machine learning
we don't care about fitting the training data,
we really care about how our classifier,
or how our method,
will perform on unseen data after training.
So, this is a terrible idea, don't do this.
So, another idea that you might think of,
is maybe we'll take our full dataset
and we'll split it into some training data
and some test data.
And now I'll try training my algorithm with different
choices of hyperparameters on the training data
and then I'll go and apply that trained classifier
on the test data and now I will pick
the set of hyperparameters that cause me to perform best
on the test data.
This seems like maybe a more reasonable strategy,
but, in fact, this is also a terrible idea
and you should never do this.
Because, again, the point of machine learning systems
is that we want to know how our algorithm will perform.
So, the point of the test set
is to give us some estimate of how our method will do
on unseen data that's coming out from the wild.
And if we use this strategy of training many different
algorithms with different hyperparameters,
and then, selecting the one which does the best
on the test data,
then, it's possible, that we may have just picked
the right set of hyperparameters
that caused our algorithm to work quite well
on this testing set,
but now our performance on this test set
will no longer be representative
of our performance of new, unseen data.
So, again, you should not do this, this is a bad idea,
you'll get in trouble if you do this.
What is much more common, is to actually split your data
into three different sets.
You'll partition most of your data into a training set
and then you'll create a validation set
and a test set.
And now what we typically do is go and train our algorithm
with many different choices of hyperparameters
on the training set,
evaluate on the validation set,
and now pick the set of hyperparameters
which performs best on the validation set.
And now, after you've done all your development,
you've done all your debugging,
after you've dome everything,
then you'd take that best performing classifier
on the validation set
and run it once on the test set.
And now that's the number that goes into your paper,
that's the number that goes into your report,
that's the number that actually is telling you how
your algorithm is doing on unseen data.
And this is actually really, really important
that you keep a very strict separation between
the validation data and the test data.
So, for example, when we're working on research papers,
we typically only touch the test set
at the very last minute.
So, when I'm writing papers,
I tend to only touch the test set for my problem
in maybe the week before the deadline or so
to really insure that we're not
being dishonest here and we're not reporting a number
which is unfair.
So, this is actually super important
and you want to make sure to keep your test data
quite under control.
So another strategy for setting hyperparameters
is called cross validation.
And this is used a little bit more commonly
for small data sets, not used so much in deep learning.
So here the idea is we're going to take our test data,
or we're going to take our dataset,
as usual, hold out some test set to use at the very end,
and now, for the rest of the data,
rather than splitting it into a single training
and validation partition,
instead, we can split our training data
into many different folds.
And now, in this way, we've cycled through choosing which
fold is going to be the validation set.
So now, in this example,
we're using five fold cross validation,
so you would train your algorithm with one set of
hyperparameters on the first four folds,
evaluate the performance on fold four,
and now go and retrain your algorithm on folds
one, two, three, and five,
evaluate on fold four,
and cycle through all the different folds.
And, when you do it this way,
you get much higher confidence about
which hyperparameters are going to perform
more robustly.
So this is kind of the gold standard to use,
but, in practice in deep learning
when we're training large models
and training is very computationally expensive,
these doesn't get used too much in practice.
Question?
[student asking a question]
Yeah, so the question is,
a little bit more concretely,
what's the difference between the training and the
validation set?
So, if you think about the k-nearest neighbor classifier
then the training set is this set of images with labels
where we memorize the labels.
And now, to classify an image,
we're going to take the image and compare it to each element
in the training data,
and then transfer the label from the nearest training point.
So now our algorithm will memorize everything
in the training set,
and now we'll take each element of the validation set
and compare it to each element in the training data
and then use this to determine what is the accuracy
of our classifier when it's applied on the validation set.
So this is the distinction between training
and validation.
Where your algorithm is able to see the labels
of the training set,
but for the validation set,
your algorithm doesn't have direct access to the labels.
We only use the labels of the validation set
to check how well our algorithm is doing.
A question?
[student asking a question]
The question is, whether the test set,
is it possible that the test set might not be
representative of data out there in the wild?
This definitely can be a problem in practice,
the underlying statistical assumption here is that
your data are all independently and identically distributed,
so that all of your data points should be
drawn from the same underlying probability distribution.
Of course, in practice, this might not always be the case,
and you definitely can run into cases
where the test set might not be super representative
of what you see in the wild.
So this is kind of a problem that dataset creators and
dataset curators need to think about.
But when I'm creating datasets, for example,
one thing I do,
is I'll go and collect a whole bunch of data all at once,
using the exact same methodology for collecting the data,
and then afterwards you go and partition it randomly
between train and test.
One thing that can screw you up here is
maybe if you're collecting data over time
and you make the earlier data, that you collect first,
be the training data,
and the later data that you collect be the test data,
then you actually might run into this shift
that could cause problems.
But as long as this partition is random
among your entire set of data points,
then that's how we try to alleviate this problem
in practice.
So then, once you've gone through this
cross validation procedure,
then you end up with graphs that look something like this.
So here, on the X axis, we are showing the value of K
for a k-nearest neighbor classifier on some problem,
and now on the Y axis, we are showing what is the accuracy
of our classifier on some dataset
for different values of K.
And you can see that, in this case,
we've done five fold cross validation over the data,
so, for each value of K we have five different examples
of how well this algorithm is doing.
And, actually, going back to the question about
having some test sets that are better or worse
for your algorithm,
using K fold cross validation
is maybe one way to help quantify that a little bit.
And, in that, we can see the variance of how this algorithm
performs on different of the validation folds.
And that gives you some sense of,
not just what is the best,
but, also, what is the distribution of that performance.
So, whenever you're training machine learning models
you end up making plots like this,
where they show you what is your accuracy,
or your performance as a function of your hyperparameters,
and then you want to go and pick the model,
or the set of hyperparameters,
at the end of the day,
that performs the best on the validation set.
So, here we see that maybe about K=7 probably works
about best for this problem.
So, k-nearest neighbor classifiers on images
are actually almost never used in practice.
Because, with all of these problems that we've talked about.
So, one problem is that it's very slow at test time,
which is the reverse of what we want,
which we talked about earlier.
Another problem is that
these things like Euclidean distance, or L1 distance,
are really not a very good way
to measure distances between images.
These, sort of, vectorial distance functions
do not correspond very well to perceptual similarity
between images.
How you perceive differences between images.
So, in this example, we've constructed,
there's this image on the left of a girl,
and then three different distorted images on the right
where we've blocked out her mouth,
we've actually shifted down by a couple pixels,
or tinted the entire image blue.
And, actually, if you compute the Euclidean distance
between the original and the boxed,
the original and the shuffled,
and original in the tinted,
they all have the same L2 distance.
Which is, maybe, not so good
because it sort of gives you the sense that
the L2 distance is really not doing a very good job
at capturing these perceptional distances between images.
Another, sort of, problem with the k-nearest neighbor
classifier has to do with something we call the curse
of dimensionality.
So, if you recall back this viewpoint we had of the
k-nearest neighbor classifier,
it's sort of dropping paint around each of the training
data points and using that to sort of partition the space.
So that means that if we expect the k-nearest neighbor
classifier to work well,
we kind of need our training examples to cover the space
quite densely.
Otherwise our nearest neighbors could actually be quite far
away and might not actually be very similar to our testing
points.
And the problem is,
that actually densely covering the space,
means that we need a number of training examples,
which is exponential in the dimension of the problem.
So this is very bad, exponential growth is always bad,
basically, you're never going to get enough images
to densely cover this space of pixels
in this high dimensional space.
So that's maybe another thing to keep in mind
when you're thinking about using k-nearest neighbor.
So, kind of the summary is that we're using
k-nearest neighbor to introduce this idea
of image classification.
We have a training set of images and labels
and then we use that
to predict these labels on the test set.
Question?
[student asking a question]
Oh, sorry, the question is,
what was going on with this picture?
What are the green and the blue dots?
So here, we have some training samples
which are represented by points,
and the color of the dot maybe represents the category
of the point, of this training sample.
So, if we're in one dimension,
then you maybe only need four training samples
to densely cover the space,
but if we move to two dimensions,
then, we now need, four times four is 16 training examples
to densely cover this space.
And if we move to three, four, five, many more dimensions,
the number of training examples that we need
to densely cover the space,
grows exponentially with the dimension.
So, this is kind of giving you the sense,
that maybe in two dimensions
we might have this kind of funny curved shape,
or you might have sort of arbitrary manifolds of labels
in different dimensional spaces.
Because the k-nearest neighbor algorithm
doesn't really make any assumptions about these
underlying manifolds,
the only way it can perform properly
is if it has quite a dense sample of training points
to work with.
So, this is kind of the overview of k-nearest neighbors
and you'll get a chance to actually implement this
and try it out on images in the first assignment.
So, if there's any last minute questions about K and N,
I'm going to move on to the next topic.
Question?
[student is asking a question]
Sorry, say that again.
[student is asking a question]
Yeah, so the question is,
why do these images have the same L2 distance?
And the answer is that, I carefully constructed them
to have the same L2 distance.
[laughing]
But it's just giving you the sense that the L2 distance
is not a very good measure of similarity between images.
And these images are actually all different from
each other in quite disparate ways.
If you're using K and N,
then the only thing you have to measure distance
between images,
is this single distance metric.
And this kind of gives you an example where
that distance metric is actually not capturing
the full description of distance or difference
between images.
So, if this case, I just sort of carefully constructed these
translations and these offsets to match exactly.
Question?
[student asking a question]
So, the question is,
maybe this is actually good,
because all of these things
are actually having the same distance to the image.
That's maybe true for this example,
but I think you could also construct examples where
maybe we have two original images
and then by putting the boxes in the right places
or tinting them,
we could cause it to be nearer to pretty much
anything that you want, right?
Because in this example, we can kind of like do arbitrary
shifting and tinting
to kind of change these distances nearly arbitrarily
without changing the perceptional nature of these images.
So, I think that this can actually screw you up
if you have many different original images.
Question?
[student is asking a question]
The question is,
whether or not it's common in real-world cases
to go back and retrain the entire dataset
once you've found those best hyperparameters?
So, people do sometimes do this in practice,
but it's somewhat a matter of taste.
If you're really rushing for that deadline
and you've really got to get this model out the door
then, if it takes a long time to retrain the model
on the whole dataset,
then maybe you won't do it.
But if you have a little bit more time to spare
and a little bit more compute to spare,
and you want to squeeze out that maybe that extra 1%
of performance, then that is a trick you can use.
So we kind of saw that the k-nearest neighbor
has a lot of the nice properties
of machine learning algorithms,
but in practice it's not so great,
and really not used very much in images.
So the next thing I'd like to talk about is
linear classification.
And linear classification is, again, quite a simple learning
algorithm, but this will become super important
and help us build up to whole neural networks
and whole convolutional networks.
So, one analogy people often talk about
when working with neural networks
is we think of them as being kind of like Lego blocks.
That you can have different kinds of components
of neural networks and you can stick these components
together to build these large different towers of
convolutional networks.
One of the most basic building blocks that we'll see
in different types of deep learning applications
is this linear classifier.
So, I think it's actually really important to
have a good understanding of what's happening
with linear classification.
Because these will end up generalizing quite nicely
to whole neural networks.
So another example of kind of this modular nature
of neural networks
comes from some research in our own lab on image captioning,
just as a little bit of a preview.
So here the setup is that we want to input an image
and then output a descriptive sentence
describing the image.
And the way this kind of works is that
we have one convolutional neural network that's looking
at the image,
and a recurrent neural network that knows
about language.
And we can kind of just stick these two pieces together
like Lego blocks and train the whole thing together
and end up with a pretty cool system
that can do some non-trivial things.
And we'll work through the details of this model as we go
forward in the class,
but this just gives you the sense that,
these deep neural networks are kind of like Legos
and this linear classifier
is kind of like the most basic building blocks
of these giant networks.
But that's a little bit too exciting for lecture two,
so we have to go back to CIFAR-10 for the moment.
[laughing]
So, recall that CIFAR-10 has these 50,000 training examples,
each image is 32 by 32 pixels and three color channels.
In linear classification, we're going to take a bit
of a different approach from k-nearest neighbor.
So, the linear classifier is one of the simplest examples
of what we call a parametric model.
So now, our parametric model actually has two different
components.
It's going to take in this image, maybe, of a cat on the left,
and this,
that we usually write as X for our input data,
and also a set of parameters, or weights,
which is usually called W, also sometimes theta,
depending on the literature.
And now we're going to write down some function
which takes in both the data, X, and the parameters, W,
and this'll spit out now 10 numbers describing
what are the scores corresponding to each of those 10
categories in CIFAR-10.
With the interpretation that, like the larger score for cat,
indicates a larger probability of that input X being cat.
And now, a question?
[student asking a question]
Sorry, can you repeat that?
[student asking a question]
Oh, so the question is what is the three?
The three, in this example, corresponds to the three color
channels, red, green, and blue.
Because we typically work on color images,
that's nice information that you don't want to throw away.
So, in the k-nearest neighbor setup
there was no parameters, instead,
we just kind of keep around the whole training data,
the whole training set,
and use that at test time.
But now, in a parametric approach,
we're going to summarize our knowledge of the training data
and stick all that knowledge into these parameters, W.
And now, at test time, we no longer need the actual
training data, we can throw it away.
We only need these parameters, W, at test time.
So this allows our models to now be more efficient
and actually run on maybe small devices like phones.
So, kind of, the whole story in deep learning
is coming up with the right structure for this
function, F.
You can imagine writing down different functional forms
for how to combine weights and data in different
complex ways, and these could correspond to different
network architectures.
But the simplest possible example
of combining these two things
is just, maybe, to multiply them.
And this is a linear classifier.
So here our F of X, W is just equal to the W times X.
Probably the simplest equation you can imagine.
So here,
if you kind of unpack the dimensions of these things,
we recall that our image was maybe 32 by 32 by 3 values.
So then, we're going to take those values and then stretch
them out into a long column vector
that has 3,072 by one entries.
And now we want to end up with 10 class scores.
We want to end up with 10 numbers for this image
giving us the scores for each of the 10 categories.
Which means that now our matrix, W,
needs to be ten by 3072.
So that once we multiply these two things out
then we'll end up with a single column vector
10 by one, giving us our 10 class scores.
Also sometimes, you'll typically see this,
we'll often add a bias term
which will be a constant vector of 10 elements
that does not interact with the training data,
and instead just gives us some sort of data independent
preferences for some classes over another.
So you might imagine that if you're dataset was
unbalanced and had many more cats than dogs,
for example, then the bias elements corresponding
to cat would be higher than the other ones.
So if you kind of think about pictorially
what this function is doing,
in this figure we have an example on the left
of a simple image with just a two by two image,
so it has four pixels total.
So the way that the linear classifier works
is that we take this two by two image,
we stretch it out into a column vector
with four elements,
and now, in this example, we are just restricting to
three classes, cat, dog, and ship,
because you can't fit 10 on a slide,
and now our weight matrix is going to be four by three,
so we have four pixels and three classes.
And now, again, we have a three element bias vector
that gives us data independent bias terms
for each category.
Now we see that the cat score is going to be the enter
product between the pixels of our image
and this row in the weight matrix
added together with this bias term.
So, when you look at it this way
you can kind of understand linear classification
as almost a template matching approach.
Where each of the rows in this matrix
correspond to some template of the image.
And now the enter product or dot product
between the row of the matrix and the column
giving the pixels of the image,
computing this dot product kind of gives us
a similarity between this template for the class
and the pixels of our image.
And then bias just, again, gives you this data
independence scaling offset to each of the classes.
If we think about linear classification
from this viewpoint of template matching
we can actually take the rows of that weight matrix
and unravel them back into images
and actually visualize those templates as images.
And this gives us some sense of what a linear
classifier might actually be doing
to try to understand our data.
So, in this example, we've gone ahead and trained
a linear classifier on our images.
And now on the bottom we're visualizing
what are those rows in that learned weight matrix
corresponding to each of the 10 categories
in CIFAR-10.
And in this way we kind of get a sense for what's
going on in these images.
So, for example, in the left, on the bottom left,
we see the template for the plane class,
kind of consists of this like blue blob,
this kind of blobby thing in the middle
and maybe blue in the background,
which gives you the sense that this linear classifier
for plane is maybe looking for blue stuff
and blobby stuff, and those features are going to cause
the classifier to like planes more.
Or if we look at this car example,
we kind of see that there's a red blobby thing
through the middle and a blue blobby thing at the top
that maybe is kind of a blurry windshield.
But this is a little bit weird,
this doesn't really look like a car.
No individual car actually looks like this.
So the problem is that the linear classifier
is only learning one template for each class.
So if there's sort of variations in how that class
might appear,
it's trying to average out all those different variations,
all those different appearances,
and use just one single template
to recognize each of those categories.
We can also see this pretty explicitly in the horse
classifier.
So in the horse classifier we see green stuff on the bottom
because horses are usually on grass.
And then, if you look carefully, the horse actually
seems to have maybe two heads, one head on each side.
And I've never seen a horse with two heads.
But the linear classifier is just doing the best
that it can, because it's only allowed to learn
one template per category.
And as we move forward into neural networks
and more complex models,
we'll be able to achieve much better accuracy
because they no longer have this restriction
of just learning a single template per category.
Another viewpoint of the linear classifier
is to go back to this idea of images
as points and high dimensional space.
And you can imagine that each of our images
is something like a point in this high dimensional space.
And now the linear classifier is putting in these
linear decision boundaries to try to draw linear
separation between one category
and the rest of the categories.
So maybe up on the upper-left hand side
we see these training examples of airplanes
and throughout the process of training
the linear classier will go and try to draw this
blue line to separate out with a single line
the airplane class from all the rest of the classes.
And it's actually kind of fun if you watch during
the training process these lines will start out randomly
and then go and snap into place to try to separate
the data properly.
But when you think about linear classification
in this way, from this high dimensional point of view,
you can start to see again what are some of the problems
that might come up with linear classification.
And it's not too hard to construct examples
of datasets where a linear classifier will totally fail.
So, one example, on the left here,
is that, suppose we have a dataset of two categories,
and these are all maybe somewhat artificial,
but maybe our dataset has two categories,
blue and red.
And the blue categories are the number of pixels
in the image, which are greater than zero, is odd.
And anything where the number of pixels greater
than zero is even, we want to classify as the red category.
So if you actually go and draw what these different
decisions regions look like in the plane,
you can see that our blue class with an odd number of pixels
is going to be these two quadrants in the plane,
and even will be the opposite two quadrants.
So now, there's no way that we can draw a single linear line
to separate the blue from the red.
So this would be an example where a linear classifier
would really struggle.
And this is maybe not such an artificial thing after all.
Instead of counting pixels,
maybe we're actually trying to count whether the number
of animals or people in an image is odd or even.
So this kind of a parity problem
of separating odds from evens
is something that linear classification
really struggles with traditionally.
Other situations where a linear classifier really struggles
are multimodal situations.
So here on the right,
maybe our blue category has these three different islands
of where the blue category lives,
and then everything else is some other category.
So, for something like horses,
we saw on the previous example,
is something where this actually might be happening
in practice.
Where there's maybe one island in the pixel space of
horses looking to the left,
and another island of horses looking to the right.
And now there's no good way to draw a single linear
boundary between these two isolated islands of data.
So anytime where you have multimodal data,
like one class
that can appear in different regions of space,
is another place where linear classifiers might struggle.
So there's kind of a lot of problems with
linear classifiers, but it is a super simple algorithm,
super nice and easy to interpret and easy to understand.
So you'll actually be implementing these things
on your first homework assignment.
At this point,
we kind of talked about
what is the functional form corresponding to a
linear classifier.
And we've seen that this functional form
of matrix vector multiply
corresponds this idea of template matching
and learning a single template for each category
in your data.
And then once we have this trained matrix
you can use it to actually go and get your scores
for any new training example.
But what we have not told you is
how do you actually go about choosing the right W
for your dataset.
We've just talked about what is the functional form
and what is going on with this thing.
So that's something we'll really focus on next time.
And next lecture we'll talk about
what are the strategies and algorithms
for choosing the right W.
And this will lead us to questions
of loss functions and optimization
and eventually ConvNets.
So, that's a bit of the preview for next week.
And that's all we have for today.