什麼才是一個好的特徵？- 機器學習食譜#3 (What Makes a Good Feature? - Machine Learning Recipes #3)

字幕列表影片播放

JOSH GORDON: Classifiers are only
as good as the features you provide.
That means coming up with good features
is one of your most important jobs in machine learning.
But what makes a good feature, and how can you tell?
If you're doing binary classification,
then a good feature makes it easy to decide
between two different things.
For example, imagine we wanted to write a classifier
to tell the difference between two types of dogs--
greyhounds and Labradors.
Here we'll use two features-- the dog's height in inches
and their eye color.
Just for this toy example, let's make a couple assumptions
about dogs to keep things simple.
First, we'll say that greyhounds are usually
taller than Labradors.
Next, we'll pretend that dogs have only two eye
colors-- blue and brown.
And we'll say the color of their eyes
doesn't depend on the breed of dog.
This means that one of these features is useful
and the other tells us nothing.
To understand why, we'll visualize them using a toy
dataset I'll create.
Let's begin with height.
How useful do you think this feature is?
Well, on average, greyhounds tend
to be a couple inches taller than Labradors, but not always.
There's a lot of variation in the world.
So when we think of a feature, we
have to consider how it looks for different values
in a population.
Let's head into Python for a programmatic example.
I'm creating a population of 1,000
dogs-- 50-50 greyhound Labrador.
I'll give each of them a height.
For this example, we'll say that greyhounds
are on average 28 inches tall and Labradors are 24.
Now, all dogs are a bit different.
Let's say that height is normally distributed,
so we'll make both of these plus or minus 4 inches.
This will give us two arrays of numbers,
and we can visualize them in a histogram.
I'll add a parameter so greyhounds are in red
and Labradors are in blue.
Now we can run our script.
This shows how many dogs in our population have a given height.
There's a lot of data on the screen,
so let's simplify it and look at it piece by piece.
We'll start with dogs on the far left
of the distribution-- say, who are about 20 inches tall.
Imagine I asked you to predict whether a dog with his height
was a lab or a greyhound.
What would you do?
Well, you could figure out the probability of each type
of dog given their height.
Here, it's more likely the dog is a lab.
On the other hand, if we go all the way
to the right of the histogram and look
at a dog who is 35 inches tall, we
can be pretty confident they're a greyhound.
Now, what about a dog in the middle?
You can see the graph gives us less information
here, because the probability of each type of dog is close.
So height is a useful feature, but it's not perfect.
That's why in machine learning, you almost always
need multiple features.
Otherwise, you could just write an if statement
instead of bothering with the classifier.
To figure out what types of features you should use,
do a thought experiment.
Pretend you're the classifier.
If you were trying to figure out if this dog is
a lab or a greyhound, what other things would you want to know?
You might ask about their hair length,
or how fast they can run, or how much they weigh.
Exactly how many features you should use
is more of an art than a science,
but as a rule of thumb, think about how many you'd
need to solve the problem.
Now let's look at another feature like eye color.
Just for this toy example, let's imagine
dogs have only two eye colors, blue and brown.
And let's say the color of their eyes
doesn't depend on the breed of dog.
Here's what a histogram might look like for this example.
For most values, the distribution is about 50/50.
So this feature tells us nothing,
because it doesn't correlate with the type of dog.
Including a useless feature like this in your training
data can hurt your classifier's accuracy.
That's because there's a chance they might appear useful purely
by accident, especially if you have only a small amount
of training data.
You also want your features to be independent.
And independent features give you
different types of information.
Imagine we already have a feature-- height and inches--
in our dataset.
Ask yourself, would it be helpful
if we added another feature, like height in centimeters?
No, because it's perfectly correlated with one
we already have.
It's good practice to remove highly correlated features
from your training data.
That's because a lot of classifiers
aren't smart enough to realize that height in inches
in centimeters are the same thing,
so they might double count how important this feature is.
Last, you want your features to be easy to understand.
For a new example, imagine you want
to predict how many days it will take
to mail a letter between two different cities.
The farther apart the cities are, the longer it will take.
A great feature to use would be the distance
between the cities in miles.
A much worse pair of features to use
would be the city's locations given by their latitude
and longitude.
And here's why.
I can look at the distance and make
a good guess of how long it will take the letter to arrive.
But learning the relationship between latitude, longitude,
and time is much harder and would require many more
examples in your training data.
Now, there are techniques you can
use to figure out exactly how useful your features are,
and even what combinations of them are best,
so you never have to leave it to chance.
We'll get to those in a future episode.
Coming up next time, we'll continue building our intuition
for supervised learning.
We'll show how different types of classifiers
can be used to solve the same problem and dive a little bit
deeper into how they work.
Thanks very much for watching, and I'll see you then.