字幕列表 影片播放 列印英文字幕 JOSH GORDON: Classifiers are only as good as the features you provide. That means coming up with good features is one of your most important jobs in machine learning. But what makes a good feature, and how can you tell? If you're doing binary classification, then a good feature makes it easy to decide between two different things. For example, imagine we wanted to write a classifier to tell the difference between two types of dogs-- greyhounds and Labradors. Here we'll use two features-- the dog's height in inches and their eye color. Just for this toy example, let's make a couple assumptions about dogs to keep things simple. First, we'll say that greyhounds are usually taller than Labradors. Next, we'll pretend that dogs have only two eye colors-- blue and brown. And we'll say the color of their eyes doesn't depend on the breed of dog. This means that one of these features is useful and the other tells us nothing. To understand why, we'll visualize them using a toy dataset I'll create. Let's begin with height. How useful do you think this feature is? Well, on average, greyhounds tend to be a couple inches taller than Labradors, but not always. There's a lot of variation in the world. So when we think of a feature, we have to consider how it looks for different values in a population. Let's head into Python for a programmatic example. I'm creating a population of 1,000 dogs-- 50-50 greyhound Labrador. I'll give each of them a height. For this example, we'll say that greyhounds are on average 28 inches tall and Labradors are 24. Now, all dogs are a bit different. Let's say that height is normally distributed, so we'll make both of these plus or minus 4 inches. This will give us two arrays of numbers, and we can visualize them in a histogram. I'll add a parameter so greyhounds are in red and Labradors are in blue. Now we can run our script. This shows how many dogs in our population have a given height. There's a lot of data on the screen, so let's simplify it and look at it piece by piece. We'll start with dogs on the far left of the distribution-- say, who are about 20 inches tall. Imagine I asked you to predict whether a dog with his height was a lab or a greyhound. What would you do? Well, you could figure out the probability of each type of dog given their height. Here, it's more likely the dog is a lab. On the other hand, if we go all the way to the right of the histogram and look at a dog who is 35 inches tall, we can be pretty confident they're a greyhound. Now, what about a dog in the middle? You can see the graph gives us less information here, because the probability of each type of dog is close. So height is a useful feature, but it's not perfect. That's why in machine learning, you almost always need multiple features. Otherwise, you could just write an if statement instead of bothering with the classifier. To figure out what types of features you should use, do a thought experiment. Pretend you're the classifier. If you were trying to figure out if this dog is a lab or a greyhound, what other things would you want to know? You might ask about their hair length, or how fast they can run, or how much they weigh. Exactly how many features you should use is more of an art than a science, but as a rule of thumb, think about how many you'd need to solve the problem. Now let's look at another feature like eye color. Just for this toy example, let's imagine dogs have only two eye colors, blue and brown. And let's say the color of their eyes doesn't depend on the breed of dog. Here's what a histogram might look like for this example. For most values, the distribution is about 50/50. So this feature tells us nothing, because it doesn't correlate with the type of dog. Including a useless feature like this in your training data can hurt your classifier's accuracy. That's because there's a chance they might appear useful purely by accident, especially if you have only a small amount of training data. You also want your features to be independent. And independent features give you different types of information. Imagine we already have a feature-- height and inches-- in our dataset. Ask yourself, would it be helpful if we added another feature, like height in centimeters? No, because it's perfectly correlated with one we already have. It's good practice to remove highly correlated features from your training data. That's because a lot of classifiers aren't smart enough to realize that height in inches in centimeters are the same thing, so they might double count how important this feature is. Last, you want your features to be easy to understand. For a new example, imagine you want to predict how many days it will take to mail a letter between two different cities. The farther apart the cities are, the longer it will take. A great feature to use would be the distance between the cities in miles. A much worse pair of features to use would be the city's locations given by their latitude and longitude. And here's why. I can look at the distance and make a good guess of how long it will take the letter to arrive. But learning the relationship between latitude, longitude, and time is much harder and would require many more examples in your training data. Now, there are techniques you can use to figure out exactly how useful your features are, and even what combinations of them are best, so you never have to leave it to chance. We'll get to those in a future episode. Coming up next time, we'll continue building our intuition for supervised learning. We'll show how different types of classifiers can be used to solve the same problem and dive a little bit deeper into how they work. Thanks very much for watching, and I'll see you then.
A2 初級 什麼才是一個好的特徵?- 機器學習食譜#3 (What Makes a Good Feature? - Machine Learning Recipes #3) 53 8 scu.louis 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字