機器學習的7個步驟 (The 7 Steps of Machine Learning)

字幕列表影片播放

[MUSIC PLAYING]
YUFENG GUO: From detecting skin cancer
to sorting cucumbers to detecting escalators
in need of repair, machine learning
has granted computer systems entirely new abilities.
But how does it really work under the hood?
Let's walk through a basic example
and use it as an excuse to talk about the process of getting
answers from your data using machine learning.
Welcome to Cloud AI Adventures.
My name is Yufeng Guo.
On this show, we'll explore the art, science,
and tools of machine learning.
Let's pretend that we've been asked
to create a system that answers the question of whether a drink
is wine or beer.
This question answering system that we build
is called a model, and this model
is created via a process called training.
In machine learning, the goal of training
is to create an accurate model that answers our questions
correctly most of the time.
But in order to train the model, we
need to collect data to train on.
This is where we will begin.
Our data will be collected from glasses of wine and beer.
There are many aspects of drinks that we could collect data on--
everything from the amount of foam to the shape of the glass.
But for our purposes, we'll just pick two simple ones--
the color as a wavelength of light and the alcohol content
as a percentage.
The hope is that we can split our two types of drinks
along these two factors alone.
We'll call these our features from now on--
color and alcohol.
The first step to our process will
be to run out to the local grocery store,
buy up a bunch of different drinks,
and get some equipment to do our measurements-- a spectrometer
for measuring the color and a hydrometer
to measure the alcohol content.
It appears that our grocery store has an electronics
hardware section as well.
Once our equipment and then booze-- we got it all set up--
it's time for our first real step of machine
learning-- gathering that data.
This step is very important because the quality
and quantity of data that you gather
will directly determine how good your predictive model can be.
In this case, the data we collect
will be the color and alcohol content of each drink.
This will yield us a table of color, alcohol content,
and whether it's beer or wine.
This will be our training data.
So a few hours of measurements later, we've
gathered our training data and had a few drinks, perhaps.
And now it's time for our next step of machine learning--
data preparation--
where we load our data into a suitable place
and prepare it for use in our machine learning training.
We'll first put all our data together then randomize
the ordering.
We wouldn't want the order of our data
to affect how we learn since that's not
part of determining whether a drink is beer or wine.
In other words, we want to make a determination of what
a drink is independent of what drink came before or after it
in the sequence.
This is also a good time to do any pertinent visualizations
of your data, helping you see if there
is any relevant relationships between different variables
as well as show you if there are any data imbalances.
For instance, if we collected way more data points about beer
than wine, the model we train will be heavily biased
toward guessing that virtually everything that it sees
is beer since it would be right most of the time.
However, in the real world, the model
may see beer and wine in equal amount, which
would mean that it would be guessing beer wrong half
the time.
We also need to split the data into two parts.
The first part used in training our model
will be the majority of our dataset.
The second part will be used for evaluating our train model's
performance.
We don't want to use the same data that the model was trained
on for evaluation since then it would just
be able to memorize the questions,
just as you wouldn't want to use the questions from your math
homework on the math exam.
Sometimes the data we collected needs other forms
of adjusting and manipulation-- things
like duplication, normalization, error correction, and others.
These would all happen at the data preparation step.
In our case, we don't have any further data preparation needs,
so let's move on forward.
The next step in our workflow is choosing a model.
There are many models that researchers and data scientists
have created over the years.
Some are very well suited for image data, others
for sequences, such as text or music, some for numerical data,
and others for text-based data.
In our case, we have just two features-- color and alcohol
percentage.
We can use a small linear model, which
is a fairly simple one that will get the job done.
Now we move on to what is often considered
the bulk of machine learning--
the training.
In this step, we'll use our data to incrementally improve
our model's ability to predict whether a given
drink is wine or beer.
In some ways, this is similar to someone
first learning to drive.
At first, they don't know how any of the pedals, knobs,
and switches work or when they should be pressed or used.
However, after lots of practice and correcting
for their mistakes, a licensed driver emerges.
Moreover, after a year of driving,
they've become quite adept at driving.
The act of driving and reacting to real-world data
has adapted their driving abilities, honing their skills.
We will do this on a much smaller scale with our drinks.
In particular, the formula for a straight line
is y equals mx plus b, where x is the input,
m is the slope of the line, b is the y-intercept,
and y is the value of the line at that position x.
The values we have available to us to adjust or train
are just m and b, where the m is that slope and b is
the y-intercept.
There is no other way to affect the position of the line
since the only other variables are x, our input, and y,
our output.
In machine learning, there are many m's
since there may be many features.
The collection of these values is usually
formed into a matrix that is denoted
w for the weights matrix.
Similarly, for b, we arranged them together,
and that's called the biases.
The training process involves initializing some random values
for w and b and attempting to predict
the outputs with those values.
As you might imagine, it does pretty poorly at first,
but we can compare our model's predictions with the output
that it should have produced and adjust the values in w
and b such that we will have more accurate predictions
on the next time around.
So this process then repeats.
Each iteration or cycle of updating the weights and biases
is called one training step.
So let's look at what that means more concretely
for our dataset.
When we first start the training,
it's like we drew a random line through the data.
Then as each step of the training progresses,
the line moves step by step closer
to the ideal separation of the wine and beer.
Once training is complete, it's time
to see if the model is any good.
Using evaluation, this is where that dataset that we set
aside earlier comes into play.
Evaluation allows us to test our model
against data that has never been used for training.
This metric allows us to see how the model might
perform against data that it has not yet seen.
This is meant to be representative of how
the model might perform in the real world.
A good rule of thumb I use for a training-evaluation split is
somewhere on the order of 80%-20% or 70%-30%.
Much of this depends on the size of the original source dataset.
If you have a lot of data, perhaps you
don't need as big of a fraction for the evaluation dataset.
Once you've done evaluation, it's
possible that you want to see if you can further improve
your training in any way.
We can do this by tuning some of our parameters.
There were a few that we implicitly
assumed when we did our training,
and now is a good time to go back and test
those assumptions, try other values.
One example of a parameter we can tune
is how many times we run through the training set
during training.
We can actually show the data multiple times.
So by doing that, we will potentially
lead to higher accuracies.
Another parameter is learning rate.
This defines how far we shift the line
during each step based on the information
from the previous training step.
These values all play a role in how accurate our model can
become and how long the training takes.
For more complex models, initial conditions
can play a significant role as well in determining
the outcome of training.
Differences can be seen depending
on whether a model starts off training
with values initialized at zeros versus some distribution
of the values and what that distribution is.
As you can see, there are many considerations
at this phase of training, and it's important
that you define what makes a model good enough for you.
Otherwise, we might find ourselves tweaking parameters
for a very long time.
Now, these parameters are typically
referred to as hyperparameters.
The adjustment or tuning of these hyperparameters
still remains a bit more of an art than a science,
and it's an experimental process that
heavily depends on the specifics of your dataset, model,
and training process.
Once you're happy with your training and hyperparameters,
guided by the evaluation step, it's
finally time to use your model to do something useful.
Machine learning is using data to answer questions,
so prediction or inference is that step where we finally
get to answer some questions.
This is the point of all of this work where the value of machine
learning is realized.
We can finally use our model to predict whether a given
drink is wine or beer, given its color and alcohol percentage.
The power of machine learning is that we
were able to determine how to differentiate between wine
and beer using our model rather than using human judgment
and manual rules.
You can extrapolate the ideas presented today
to other problem domains as well, where
the same principles apply--
gathering data, preparing that data, choosing a model,
training it and evaluating it, doing your hyperparameter
training, and finally, prediction.
If you're looking for more ways to play
with training and parameters, check out
the TensorFlow Playground.
It's a completely browser-based machine learning sandbox,
where you can try different parameters
and run training against mock datasets.
And don't worry, you can't break the site.
Of course, we will encounter more steps and nuances
in future episodes, but this serves
as a good foundational framework to help
us think through the problem, giving us a common language
to think about each step and go deeper in the future.
Next time on AI Adventures, we'll
build our first real machine learning model, using code--
no more drawing lines and going over algebra.
[MUSIC PLAYING]