字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] Welcome back. We've covered a lot of ground already, so today I want to review and reinforce concepts. To do that, we'll explore two things. First, we'll code up a basic pipeline for supervised learning. I'll show you how multiple classifiers can solve the same problem. Next, we'll build up a little more intuition for what it means for an algorithm to learn something from data, because that sounds kind of magical, but it's not. To kick things off, let's look at a common experiment you might want to do. Imagine you're building a spam classifier. That's just a function that labels an incoming email as spam or not spam. Now, say you've already collected a data set and you're ready to train a model. But before you put it into production, there's a question you need to answer first-- how accurate will it be when you use it to classify emails that weren't in your training data? As best we can, we want to verify our models work well before we deploy them. And we can do an experiment to help us figure that out. One approach is to partition our data set into two parts. We'll call these Train and Test. We'll use Train to train our model and Test to see how accurate it is on new data. That's a common pattern, so let's see how it looks in code. To kick things off, let's import a data set into [? SyKit. ?] We'll use Iris again, because it's handily included. Now, we already saw Iris in episode two. But what we haven't seen before is that I'm calling the features x and the labels y. Why is that? Well, that's because one way to think of a classifier is as a function. At a high level, you can think of x as the input and y as the output. I'll talk more about that in the second half of this episode. After we import the data set, the first thing we want to do is partition it into Train and Test. And to do that, we can import a handy utility, and it makes the syntax clear. We're taking our x's and our y's, or our features and labels, and partitioning them into two sets. X_train and y_train are the features and labels for the training set. And X_test and y_test are the features and labels for the testing set. Here, I'm just saying that I want half the data to be used for testing. So if we have 150 examples in Iris, 75 will be in Train and 75 will be in Test. Now we'll create our classifier. I'll use two different types here to show you how they accomplish the same task. Let's start with the decision tree we've already seen. Note there's only two lines of code that are classifier-specific. Now let's train the classifier using our training data. At this point, it's ready to be used to classify data. And next, we'll call the predict method and use it to classify our testing data. If you print out the predictions, you'll see there are a list of numbers. These correspond to the type of Iris the classifier predicts for each row in the testing data. Now let's see how accurate our classifier was on the testing set. Recall that up top, we have the true labels for the testing data. To calculate our accuracy, we can compare the predicted labels to the true labels, and tally up the score. There's a convenience method in [? Sykit ?] we can import to do that. Notice here, our accuracy was over 90%. If you try this on your own, it might be a little bit different because of some randomness in how the Train/Test data is partitioned. Now, here's something interesting. By replacing these two lines, we can use a different classifier to accomplish the same task. Instead of using a decision tree, we'll use one called [? KNearestNeighbors. ?] If we run our experiment, we'll see that the code works in exactly the same way. The accuracy may be different when you run it, because this classifier works a little bit differently and because of the randomness in the Train/Test split. Likewise, if we wanted to use a more sophisticated classifier, we could just import it and change these two lines. Otherwise, our code is the same. The takeaway here is that while there are many different types of classifiers, at a high level, they have a similar interface. Now let's talk a little bit more about what it means to learn from data. Earlier, I said we called the features x and the labels y, because they were the input and output of a function. Now, of course, a function is something we already know from programming. def classify-- there's our function. As we already know in supervised learning, we don't want to write this ourselves. We want an algorithm to learn it from training data. So what does it mean to learn a function? Well, a function is just a mapping from input to output values. Here's a function you might have seen before-- y equals mx plus b. That's the equation for a line, and there are two parameters-- m, which gives the slope; and b, which gives the y-intercept. Given these parameters, of course, we can plot the function for different values of x. Now, in supervised learning, our classified function might have some parameters as well, but the input x are the features for an example we want to classify, and the output y is a label, like Spam or Not Spam, or a type of flower. So what could the body of the function look like? Well, that's the part we want to write algorithmically or in other words, learn. The important thing to understand here is we're not starting from scratch and pulling the body of the function out of thin air. Instead, we start with a model. And you can think of a model as the prototype for or the rules that define the body of our function. Typically, a model has parameters that we can adjust with our training data. And here's a high-level example of how this process works. Let's look at a toy data set and think about what kind of model we could use as a classifier. Pretend we're interested in distinguishing between red dots and green dots, some of which I've drawn here on a graph. To do that, we'll use just two features-- the x- and y-coordinates of a dot. Now let's think about how we could classify this data. We want a function that considers a new dot it's never seen before, and classifies it as red or green. In fact, there might be a lot of data we want to classify. Here, I've drawn our testing examples in light green and light red. These are dots that weren't in our training data. The classifier has never seen them before, so how can it predict the right label? Well, imagine if we could somehow draw a line across the data like this. Then we could say the dots to the left of the line are green and dots to the right of the line are red. And this line can serve as our classifier. So how can we learn this line? Well, one way is to use the training data to adjust the parameters of a model. And let's say the model we use is a simple straight line like we saw before. That means we have two parameters to adjust-- m and b. And by changing them, we can change where the line appears. So how could we learn the right parameters? Well, one idea is that we can iteratively adjust them using our training data. For example, we might start with a random line and use it to classify the first training example. If it gets it right, we don't need to change our line, so we move on to the next one. But on the other hand, if it gets it wrong, we could slightly adjust the parameters of our model to make it more accurate. The takeaway here is this. One way to think of learning is using training data to adjust the parameters of a model. Now, here's something really special. It's called tensorflow/playground. This is a beautiful example of a neural network you can run and experiment with right in your browser. Now, this deserves its own episode for sure, but for now, go ahead and play with it. It's awesome. The playground comes with different data sets you can try out. Some are very simple. For example, we could use our line to classify this one. Some data sets are much more complex. This data set is especially hard. And see if you can build a network to classify it. Now, you can think of a neural network as a more sophisticated type of classifier, like a decision tree or a simple line. But in principle, the idea is similar. OK. Hope that was helpful. I just created a Twitter that you can follow to be notified of new episodes. And the next one should be out in a couple of weeks, depending on how much work I'm doing for Google I/O. Thanks, as always, for watching, and I'll see you next time.
B1 中級 美國腔 讓我們來寫一個管道--機器學習食譜#4。 (Let’s Write a Pipeline - Machine Learning Recipes #4) 63 7 scu.louis 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字