讓我們來寫一個管道--機器學習食譜#4。 (Let’s Write a Pipeline - Machine Learning Recipes #4)

字幕列表影片播放

[MUSIC PLAYING]
Welcome back.
We've covered a lot of ground already,
so today I want to review and reinforce concepts.
To do that, we'll explore two things.
First, we'll code up a basic pipeline
for supervised learning.
I'll show you how multiple classifiers
can solve the same problem.
Next, we'll build up a little more intuition
for what it means for an algorithm to learn something
from data, because that sounds kind of magical, but it's not.
To kick things off, let's look at a common experiment
you might want to do.
Imagine you're building a spam classifier.
That's just a function that labels an incoming email
as spam or not spam.
Now, say you've already collected a data set
and you're ready to train a model.
But before you put it into production,
there's a question you need to answer first--
how accurate will it be when you use it to classify emails that
weren't in your training data?
As best we can, we want to verify our models work well
before we deploy them.
And we can do an experiment to help us figure that out.
One approach is to partition our data set into two parts.
We'll call these Train and Test.
We'll use Train to train our model
and Test to see how accurate it is on new data.
That's a common pattern, so let's see how it looks in code.
To kick things off, let's import a data set into [? SyKit. ?]
We'll use Iris again, because it's handily included.
Now, we already saw Iris in episode two.
But what we haven't seen before is
that I'm calling the features x and the labels y.
Why is that?
Well, that's because one way to think of a classifier
is as a function.
At a high level, you can think of x as the input
and y as the output.
I'll talk more about that in the second half of this episode.
After we import the data set, the first thing we want to do
is partition it into Train and Test.
And to do that, we can import a handy utility,
and it makes the syntax clear.
We're taking our x's and our y's,
or our features and labels, and partitioning them
into two sets.
X_train and y_train are the features and labels
for the training set.
And X_test and y_test are the features and labels
for the testing set.
Here, I'm just saying that I want half the data to be
used for testing.
So if we have 150 examples in Iris, 75 will be in Train
and 75 will be in Test.
Now we'll create our classifier.
I'll use two different types here
to show you how they accomplish the same task.
Let's start with the decision tree we've already seen.
Note there's only two lines of code
that are classifier-specific.
Now let's train the classifier using our training data.
At this point, it's ready to be used to classify data.
And next, we'll call the predict method
and use it to classify our testing data.
If you print out the predictions,
you'll see there are a list of numbers.
These correspond to the type of Iris
the classifier predicts for each row in the testing data.
Now let's see how accurate our classifier
was on the testing set.
Recall that up top, we have the true labels for the testing
data.
To calculate our accuracy, we can
compare the predicted labels to the true labels,
and tally up the score.
There's a convenience method in [? Sykit ?]
we can import to do that.
Notice here, our accuracy was over 90%.
If you try this on your own, it might be a little bit different
because of some randomness in how the Train/Test
data is partitioned.
Now, here's something interesting.
By replacing these two lines, we can use a different classifier
to accomplish the same task.
Instead of using a decision tree,
we'll use one called [? KNearestNeighbors. ?]
If we run our experiment, we'll see that the code
works in exactly the same way.
The accuracy may be different when you run it,
because this classifier works a little bit differently
and because of the randomness in the Train/Test split.
Likewise, if we wanted to use a more sophisticated classifier,
we could just import it and change these two lines.
Otherwise, our code is the same.
The takeaway here is that while there are many different types
of classifiers, at a high level, they have a similar interface.
Now let's talk a little bit more about what
it means to learn from data.
Earlier, I said we called the features x and the labels y,
because they were the input and output of a function.
Now, of course, a function is something we already
know from programming.
def classify-- there's our function.
As we already know in supervised learning,
we don't want to write this ourselves.
We want an algorithm to learn it from training data.
So what does it mean to learn a function?
Well, a function is just a mapping from input
to output values.
Here's a function you might have seen before-- y
equals mx plus b.
That's the equation for a line, and there
are two parameters-- m, which gives the slope;
and b, which gives the y-intercept.
Given these parameters, of course,
we can plot the function for different values of x.
Now, in supervised learning, our classified function
might have some parameters as well,
but the input x are the features for an example we
want to classify, and the output y
is a label, like Spam or Not Spam, or a type of flower.
So what could the body of the function look like?
Well, that's the part we want to write algorithmically
or in other words, learn.
The important thing to understand here
is we're not starting from scratch
and pulling the body of the function out of thin air.
Instead, we start with a model.
And you can think of a model as the prototype for
or the rules that define the body of our function.
Typically, a model has parameters
that we can adjust with our training data.
And here's a high-level example of how this process works.
Let's look at a toy data set and think about what kind of model
we could use as a classifier.
Pretend we're interested in distinguishing
between red dots and green dots, some of which
I've drawn here on a graph.
To do that, we'll use just two features--
the x- and y-coordinates of a dot.
Now let's think about how we could classify this data.
We want a function that considers
a new dot it's never seen before,
and classifies it as red or green.
In fact, there might be a lot of data we want to classify.
Here, I've drawn our testing examples
in light green and light red.
These are dots that weren't in our training data.
The classifier has never seen them before, so how can
it predict the right label?
Well, imagine if we could somehow draw a line
across the data like this.
Then we could say the dots to the left
of the line are green and dots to the right of the line are
red.
And this line can serve as our classifier.
So how can we learn this line?
Well, one way is to use the training data to adjust
the parameters of a model.
And let's say the model we use is a simple straight line
like we saw before.
That means we have two parameters to adjust-- m and b.
And by changing them, we can change where the line appears.
So how could we learn the right parameters?
Well, one idea is that we can iteratively adjust
them using our training data.
For example, we might start with a random line
and use it to classify the first training example.
If it gets it right, we don't need to change our line,
so we move on to the next one.
But on the other hand, if it gets it wrong,
we could slightly adjust the parameters of our model
to make it more accurate.
The takeaway here is this.
One way to think of learning is using training data
to adjust the parameters of a model.
Now, here's something really special.
It's called tensorflow/playground.
This is a beautiful example of a neural network
you can run and experiment with right in your browser.
Now, this deserves its own episode for sure,
but for now, go ahead and play with it.
It's awesome.
The playground comes with different data
sets you can try out.
Some are very simple.
For example, we could use our line to classify this one.
Some data sets are much more complex.
This data set is especially hard.
And see if you can build a network to classify it.
Now, you can think of a neural network
as a more sophisticated type of classifier,
like a decision tree or a simple line.
But in principle, the idea is similar.
OK.
Hope that was helpful.
I just created a Twitter that you can follow
to be notified of new episodes.
And the next one should be out in a couple of weeks,
depending on how much work I'm doing for Google I/O. Thanks,
as always, for watching, and I'll see you next time.