TensorFlow高級API。第二部分--深入研究數據和特性 (TensorFlow high-level APIs: Part 2 - going deep on data and features)

字幕列表影片播放

[MUSIC PLAYING]
KARMEL ALLISON: Hi and welcome to Coding TensorFlow.
I'm Karmel Allison, and I'm here to guide you
through a scenario using TensorFlow's high-level APIs.
This video is the second in a three-part series.
In this, we'll dig deeper into preparing the data for machine
learning, including using feature
columns, categorical data, and much more.
We'll also explore a machine learning
model built using Keras that can be trained with this data.
In the previous video, we spoke about a complex data set
and how you can load that and get
it ready to use in TensorFlow.
We used the Covertype data set from the US Forestry Service
and Colorado State University, which
has about 500,000 rows of geophysical data collected
from particular regions in National Forest areas.
We are going to use the features in this data set
to try to predict the soil type that was found in each region.
We took the raw data and put it into a TensorFlow data
set that generates dictionaries of feature tensors and labels,
but we still have lots of feature types.
Some are continuous, some are categorical,
some are one-hot encoded.
We need to represent these in a way that
is meaningful to an ML model.
You'll learn how to do that in this video,
so let's get started.
We are going to use feature columns for that.
In TensorFlow, a feature column is a configuration class.
It doesn't itself hold any data but it tells our model
how to transform the raw data so that it matches the expectation
in many ML models that the data is numeric and continuous.
If you're working with data that is already numeric,
image data, for example, feature columns may not be necessary,
but for many real-world applications,
data is structured and represents
vocabularies or human concepts that we
need to transform before we can use them in machine learning
models.
Feature columns are a great way to do that.
Let's take, for example, our Covertype category, which
is an integer between 1 and 7 that represents
the type of tree in the region.
You'll note that all we've done here
is define a type of feature, and we
haven't passed any of our data into that feature yet.
It is just a configuration object
that will tell our model to expect
categorical IDs less than the outer range value of 8.
Now we have to configure how we want
to transform our categorical data for use in a model that
expects continuous data.
Using feature columns, we can trivially
build a set of instructions that allow the model to convert
the categories into an embedding column, as shown here.
Now, we could have done this processing in our data parsing
function ourselves, converting the categorical IDs
to a one-hot vector manually.
The advantage of using feature columns
is that the transformations they encode
become part of the model's graph and can therefore be
exported with the saved model.
So you should push any transformations
that you want to apply to data both during training
and at inference time into feature columns.
We can define columns for each of our features.
Data that is already numeric is straightforward.
We just use a numeric column.
Sometimes, as in the case of soil type data here,
data is spread out over a vector,
and numeric feature columns allow us to easily capture
that relationship with the shape argument
so that our model understands wilderness area as a length 40
tensor rather than 40 independent tensors.
All right, so we configure all of our features, and then what?
Well, these become the first layer
of our model using a feature layer.
When we train our model, this first layer
will act like any other Keras layer,
but its primary role will be to take in the raw data, including
the categorical indices, and transform it
into the appropriate representations
that our neural net is expecting.
This layer will also handle creating and training
our embedding Covertype.
So if you have data that needs transformation
before it fits into a model--
maybe it's categorical like ours or even has
string names and vocabularies--
you can use feature columns to handle those transformations,
batch by batch, in TensorFlow, rather
than having a whole separate pipeline to do feature
transformations in memory.
TensorFlow provides many feature columns and even ways
to combine individual columns into more
complex representations of the data that your model can learn.
So, before we wrap up, let me quickly
show you how this would be a layer in a Keras model, which
we'll go into in more detail in the next video.
Note that we are using tf.keras here,
which implements the Keras API spec
but adds additional TensorFlow-specific
features on top of it, like support
for TensorFlow's eager execution, optimizers,
and so on.
Since the first thing I want to try
is a simple sequence of deep learning layers,
Keras is the easiest way to start.
We will start with a simple sequential model,
but what I want to focus on right
now is just this first layer.
Our first layer is a feature layer
that will do all the data transformation we just
discussed and feed the transformed data
into the rest of the model.
We'll do that in part three of this series, where we'll
look at adding the data and training the model with it,
including choosing loss functions and optimizers.
It will be right here on the TensorFlow YouTube channel,
so don't forget to hit that Subscribe button,
and I'll see you there.
[MUSIC PLAYING]