分佈式處理和組件(TensorFlow Extended) (Distributed Processing and Components (TensorFlow Extended))

字幕列表影片播放

Hi, I'm Robert Crowe,
and today I'm going to be talking about TensorFlow Extended,
also known as TFX,
and how it helps you put your amazing machine learning models
into production.
This is Episode 4 of our five-part series
on Real world machine learning in production.
We've covered a lot so far in Episodes 1-3,
so if you haven't seen those yet,
I'd really recommend watching them.
In today's episode,
we'll be talking about Distributed processing and components.
Let's get started.
♪ (music) ♪
Let's talk about the components that come standard with TFX.
But, before we talk about the standard components,
let's talk about Apache Beam.
To handle distributed processing of large amounts of data,
especially compute-intensive data like ML workloads,
you really need a distributed processing pipeline framework
like Apache Spark, or Apache Flink, or Google Cloud Dataflow.
So, several of the TFX components run on top of Apache Beam,
which is a unified programming model
that can run on nearly any execution engine.
Beam allows you to use
the distributed processing framework you already have,
or choose one that you'd like,
rather than forcing you to use the one that we chose.
Currently, Beam Python can run on Flink, Spark, and Dataflow runners,
but new runners are being added.
It also includes a local runner,
which enables you to run a TFX pipeline in development
on your local system, like your laptop.
So, in the case of the Transform component,
for example,
we use Beam to perform feature-engineering transformations
like creating a vocabulary or doing PCA.
That could be running on your Flink or Spark cluster,
or on the Google Cloud, using Dataflow,
or, because it uses Beam,
you could migrate between them without changing your code.
In the case of the Trainer component,
we're really just using TensorFlow.
Remember when all we were thinking about
was training our amazing model?
That's the code we're using here.
Note that, currently, TFX only supports 1.X models and TF Estimators.
In the case of some components,
we're really only writing Python.
The Pusher component, for example, only needs Python to do its job.
So, when we put all these together and manage it all with an orchestrator,
this is what it looks like.
On the left, we're ingesting our data,
and on the right, we're pushing our saved models
to one or more of our deployment targets.
That includes modeled repositories like TensorFlow Hub,
or JavaScript environments using TensorFlow.js,
or native mobile applications using TensorFlow Lite,
or server farms using TensorFlow Serving,
or all of the above.
So now let's look at each of these components
in a little more detail.
First, we ingest our input data using ExampleGen.
ExampleGen is one of the components that runs on Beam.
It reads in data,
splits it into training and eval,
and formats it as TF examples.
This is what the configuration looks like for ExampleGen.
Very simple. Just two lines of Python.
Next, we have StatisticsGen.
StatisticsGen makes a full pass over the data, using Beam,
one full epoch,
and calculates descriptive statistics for each of our features.
To do that, it leverages the TensorFlow Data Validation library
which includes support for some visualization tools
that you can run in a Jupyter notebook.
That lets you explore and understand your data
and find any issues that you may have.
This is typical data-wrangling stuff,
the same thing we all do when we're preparing our data
to train our model.
Here's a better look at the visualization tools.
Right away, we can see that we might have a problem
with our trip_start_hour feature at 6:00 a.m.,
where we don't have a lot of data to make predictions.
Our model's performance at that time of day
might not be so great,
unless we go get some new data.
Our next component, SchemaGen,
also uses the TensorFlow Data Validation library.
It looks at the statistics which were generated by StatisticsGen
and tries to infer the types for each of our features,
including the range of categories for categorical features.
We can adjust the schema as needed,
like adding new categories that we expect to see.
Our next component, ExampleValidator,
takes the statistics from StatisticsGen,
and the schema, which may be the output of SchemaGen
or the results of user curation,
and looks for problems.
It looks for anomalies, missing values,
or values that don't match our schema,
and produces a report of what it finds.
Remember that we're taking in new data all the time,
so we need to be aware of problems when they pop up.
Transform is one of the more complex components
and requires a bit more configuration as well as additional code.
Transform uses Beam to do feature engineering,
applying transformations to your features
to improve the performance of your model.
For example,
Transform can create vocabularies, or bucketize values,
or run PCA over your input.
The code that you write depends on what feature engineering
you need to do for your model and dataset.
Transform will make a full pass over your data, one full epoch,
and create two different kinds of results.
For things like calculating the median
or standard deviation of a feature,
numbers which are the same for all examples,
Transform will output a constant.
For things like normalizing a value,
values which will be different for different examples,
Transform will output TensorFlow ops.
Transform will then output a TensorFlow graph
with those constants and ops.
That graph is hermetic,
so it contains all of the information you need
to apply those transformations,
and will form the input stage for your model.
That means that the same transformations are applied consistently
between training and serving,
which eliminates training/serving skew.
If, instead, you're moving your model from a training environment
into a serving environment or application,
and trying to apply the same feature engineering in both places,
you hope that the transformations are the same,
but sometimes you find that they're not.
We call that training/serving skew,
and Transform eliminates it
by using exactly the same code anywhere you run your model.
Now we're finally ready to train our model,
the part of the process that you often think about
when you think about machine learning.
Trainer takes in the Transform graph and data from Transform,
and the schema from SchemaGen,
and trains a model using your modeling code.
This is normal model training,
but when training is complete,
Trainer will save two different SavedModels.
One is a normal SavedModel that will be deployed to production,
and the other is an EvalSavedModel
that will be used for analyzing the performance of your model.
The configuration for Trainer is what you'd expect--
things like the number of steps,
and whether or not to use warm_starting.
The code that you create for Trainer is your modeling code,
so it can be as simple or complex as you need it to be.
To monitor and analyze the training process,
you can use TensorBoard, just like you would normally.
In this case, you can look at the current model-training run
or compare the results from multiple model-training runs.
This is only possible because of the ML-Metadata store,
which we talked about in our last episode.
TFX makes it fairly easy to do this kind of comparison,
which is often revealing.
Now that we've trained our model,
how do the results look?
The Evaluator component will take the EvalSavedModel
the Trainer created,
and the original input data,
and do deep analysis, using Beam
and the TensorFlow Model Analysis library.
It's not just looking at the top level results
across the whole dataset,
it's looking deeper than that--
at individual slices of our dataset.
That's important,
because the experience that each user of our model has
will depend on their individual data point.
Our model may do well over our entire dataset,
but if it does poorly on the data point that a user gives it,
that user's experience is poor.
We'll talk about this more in our next episode.
So, now that we've looked at our model's performance,
should we push it to production?
Is it better or worse than what we already have in production?
We don't want to push a worse model just because it's new.
So the ModelValidator component uses Beam to do that comparison,
using criteria that we define
to decide whether or not to push the new model to production.
If ModelValidator decides
that our new model is ready for production,
then Pusher does the work of actually pushing it
to our deployment targets.
Those targets could be TensorFlow Lite
if we're doing a mobile application,
or TensorFlow.js if we're deploying to a JavaScript environment,
or TensorFlowServing, if we're deploying to a server farm,
or all of the above.
So now, hopefully, we've given you a basic understanding
of TFX and ML pipelines, in general,
and some of the issues around putting an ML application into production.
And, remember, TFX is open source.
We want you to help contribute to making TFX better.
In our next episode,
we'll look at a real-world example
of why analyzing model performance is important to your business.
For more information on TFX,
visit us at tensorflow.org/tfx.
And don't forget to comment and like us below,
and thanks for watching.
♪ (music) ♪