TFX:2020年用TensorFlow生產ML（TF Dev Summit '20）。 (TFX: Production ML with TensorFlow in 2020 (TF Dev Summit '20))

字幕列表影片播放

[MUSIC PLAYING]
TRIS WARKENTIN: Hi, everyone.
I'm Tris Warkentin and I'm a product manager
on TensorFlow Extended, or TFX.
ZHITAO LI: Hi, my name is Zhitao Li.
I'm a tech lead manager from TFX Open Source.
TRIS WARKENTIN: At Google, putting machine learning models
into production is one of the most important things
our engineers and researchers do.
But to achieve this global reach and production readiness,
a reliable production platform is
critical to Google's success.
And that's the goal of TensorFlow Extended--
to create a stable platform for production ML at Google,
and a stable platform for you to build production-ready ML
systems, too.
So how does that work?
Our philosophy is to take modern software engineering
and combine it with what we've learned about machine learning
development at Google.
So what's the difference between writing code and doing machine
learning engineering?
In coding, you might build something
that one person can create end to end.
You might have untested code, undocumented code,
and code that's hard to reuse.
In modern software engineering, we
have solutions for all of those problems-- test-driven
development, modular designs, scalable performance
optimization, and much more.
So how is that different in machine learning development?
Well, a lot of the problems from coding still apply to ML,
but we also have a variety of new problems.
We have no problem statements.
We might need some continuous optimization.
We might need to understand when changes in data
will result in different shapes of our models.
We've been doing this at Google for a long time.
In 2007, we launched Sibyl, which
was our production scalable platform for production ML
here at Google.
And since 2016, we've been working on TFX,
and last year we open sourced it to make
it even easier for you to build production
ML in your platforms.
What does it look like in practice?
Well, the entirety of TFX as an end to end platform
runs from best practices all the way through to end to end.
From best practices, you don't even
have to use a single line of Google developed code
in order to get some of the best of TFX, all the way
through to end to end pipelines that allow you to deploy
scalable production scale ML.
This is what a pipeline might look like.
On the left side of the screen, you'll see data intake,
then it runs through the pipeline,
doing things like data validation, schema generation,
and much more in order to make sure
that you're doing things in a repeatable, testable,
consistent way and producing ML production results.
So it's hard to believe that we've only
been one year in open source for our end
to end pipeline offering, but we have
a lot of interesting things that we've done in 2019,
including building the foundations of metadata,
building basic 2.0 support for things like estimators,
as well as launches of Fairness Indicators and TFMA.
But we're definitely not done.
In 2020, we have a wide variety of interesting developments
coming, including NativeKeras on TFX,
which you'll hear more about from Zhitao later,
as well as TensorFlow Lite trainer rewrite
and some warm starting, which can make your machine learning
training a hundred times faster by using caching.
But we have something really exciting
that we're announcing today, which
you may have heard from Megan about in the keynote, which
is end to end ML pipelines.
These are our Cloud AI Platform Pipelines.
We're really excited about these, because they combine
a lot of the best of Google AI Platform with TFX
to create Cloud AI Platform Pipelines, available today.
Please check out our blog for more information.
You should be able to find it if you just Google
"Cloud API Platform Pipelines."
And now, can we please cut to the demo?
ZHITAO LI: So I'll be giving an explanation about this demo.
This is the Cloud AI Platform Pipelines page.
As you see, you can see all your existing Cloud AI Pipeline
clusters in this page.
We've already created one, and this page
can be found at AI Platforms Pipelines tab
from the left of the Google Cloud Console.
If you don't have any pipelines cluster created yet,
you can use the New Instance button to create a new one.
This gives you a click button experience
while creating clusters, which is usually
one of the difficult jobs in the past.
You can use the Config button to create a Cloud AI
Pipelines on Google Cloud.
This gives you Cloud AI Pipelines space on Kubenetes.
Runs on Google's GKE.
You can choose a class it will run it from, choose
the namespace where you want to create a class for inside,
and choose a name of a cluster.
Once you are done, you can simply click Deploy and Done.
Since I already have a cluster, I
will open up the Pipeline dashboard here.
In this page, you can see a list of demo pipelines
that you can play with.
You can see tutorials about creating pipelines and doing
various techniques, and you can use the Pipelines tab
from the left to view all your existing pipelines here.
Since this class is newly created,
there is no TFX pipelines in it yet.
We are going to use the newly launched TFX
templates to create the Cloud AI Pipelines in this cluster.
This is the Cloud AI Notebook.
I'm pretty much using this as a Python shell
to write some simple Python commands.
First, you set up your environments
and then making sure TFX is properly installed, together
with some other dependencies.
Making sure you have environment variables
like Path Properties set up, and the TFX version is up to date.
Now you're making sure you have a Google Cloud project--
perfect config.
In the Config, the Cloud AI Pipelines cluster endpoint.
Simply copy that from the URL into the Notebook shell.
Now, we're also making sure we create the Google Container
Image repo so that we can upload our containers, too.
Once that is done, we config the pipeline name and the project
directory.
Now we can use the Template Creation
to create a new template.
Since they are created, I'm going
to show the content created by the templates.
As you see, there are pipeline code in the pipeline.py file.
This includes our classical taxi pipeline
from TFX with all the components necessary to do
production machine learning.
There is configs.py, with some configurations
related to Google Cloud as well as some configuration about TFX
itself.
Once that is done, we enter the Templates Directory,
making sure all the templates valid are there.
You can even run some pre-generated unit test
on the features to making sure the configuration looks right.
Once that's done, you can then use TFX CLI command
to create a TFX pipeline on Google Cloud Pipelines page.
This will create a temporary image
with all your code and the dependencies.
Upload them to DCR, then create a pipeline using this container
image on the Pipelines page.
As we see, the pipeline compiles and the creation is successful,
and we go back to the Pipeline page.
Click on Refresh.
Boom-- we have our new pipeline.
Now, if we click through the pipeline,
you are going to see all the TFX components here
readily available.
We can create a test to run on this one.
And click on the Run.
We are going to see each of the components.
When they run, they will gradually
show up on the web page.
The first component should be ExampleGen.
So it has to be ExampleGen. Yes, it is there.
This component has started running.
You can click on it.
On the tab, you can look at the artifacts, inputs and outputs,
what Kubernetes' volumes used for the component, manifest,
and you can even inspect the logs of a component run.
We call this ExampleGen, StatsGen, SchemaGen.
And now the pipeline enters feature transform
and the example validation at the same time.
So now all the data preparation is finished.
The pipeline enters into a training stage,
which is producing a TensorFlow model.
If we click on the trainer component,
we can even inspect these logs.
Now, once trainer is complete, we
do some model validation and evaluation
using TFX components.
And once all the model evaluation is done,
we use Pusher to push the generated model
onto external serving system.
So everything-- so you have a model
ready to use in production.
You can also use the tabs on the left
to navigate on existing experiments, artifacts,
and executions.
We are going to take a look at the artifacts generated
from this pipeline using the Artifacts tab.
So here, you can see you have a pipeline.
If you click on the Model Output Artifacts from Trainer,
that represents a TensorFlow model.
This is the artifact ML metadata.
We can see it's a model artifact produced by trainer.
And this is a lineage view of the model--
this explains what components produced this model
from what input artifacts and how this artifact is further
used by other components which takes this one's input
and what outputs are generated from the downstream components.
OK this is all of the demo.
Now I'm going to talk about another important development
in TFX, which is supporting NativeKeras.
For those of you who are not very familiar with TensorFlow
2, let me capture a little of the history.
TensorFlow 2 was released in Q3 2019
with a focus on providing a more Pythonic experience.
That includes supporting the Keras
API, eager execution by default, and the Pythonic execution.
So this is a timeline of how TFX Open Source has been working
on supporting all of this.
We released the first version with TFX Open Source
in the last Dev Summit, which only
supports estimator-based TensorFlow training code.
Back to last TensorFlow World, TensorFlow 2.0 was launched
and we started working on supporting the Keras API.
Previous slide, please?
Previous slide, please?
Back to Q4 2019.
We released the basic TensorFlow 2.0 support.
In TFX 0.20-- in that version, we supported TensorFlow 2.0
package end to end with a limited Keras support with
Keras Estimator.
And now, in the latest release TFX, I'm happy to release--
we are releasing experimental support of NativeKeras training
end to end.
So what does that mean?
Let's take a deeper look.
For data ingestion analysis, everything pretty much remains
the same, because TFTP, our data analysis library,
is model agnostic.
For future transform, we added a new Keras compatible layer
in TFT library so that we can transform features
in Keras model.
This layer will also take care of asset management
and the model exporting.
For training, we created a new generic trainer executor
which can be used to run any TensorFlow training code which
explores a saved model.
This also covers the training using NativeKeras API.
For model analysis and validation,
we create a new evaluator component,
which combines both evaluation and the model validation
capabilities.
This new component supports NativeKeras auto blocks.
And finally, when it gets to model serving validation,
we will release a new component called infra validator.
This component can be used to verify inference request, where
TensorFlow is serving binaries, to making
sure any exported TensorFlow model can be used correctly
in production, including anything
traded with NativeKeras.
Now, let's take a look at a case from one
of our partners, Concur Labs.
Concur Labs is a team in SAP Concur that explores new ideas
and are building prototypes.
They help all the developers in SAP Concur using ML effectively
in their solutions.
To do this, they need a modern machine learning pipeline
to secure scales that are facing their data platform.
They find that TFX with NativeKeras trainer
allows them to do more things.
One of the successful story here is the efficient BERT
deployment.
With the TF and the Keras, they can create simple models
by using TF Transform for the data pre-processing,
using state of the art models from TensorFlow hub
and create simple model deployments
with TensorFlow Serving.
The applications here covers some
of the sentimental analysis and some of the question and answer
type problems.
They also create a TFX pipelines to build
the pre-processing steps in export data BERT models.
We have just published a blog post on this,
so feel free to check out the TFX blog post.
Another successful story is TFX Pipelines for TFLite Models.
This means people-- they can create a TFX pipelines that
produced two models--
one in saved model format and the one
in TensorFlow Lite version.
This simplifies our pipeline building process
and it reduces manual conversion steps.
One of the most important things for the future of our ecosystem
is great partners, like all of you on the livestream.
We hope you join us to help making TFX work for your use
case.
Some of the areas we would love your help,
including portability, on-prem and the multi-cloud,
Spark/Flink/HDFS integration, as well as
data and the model governance.
One of the great things about TFX
is the wide diversity of the ways It can be used.
We have a exciting guest from--
Marcel from Airbus.
He could not be here physically today,
but he recorded a video to talk about one
of the interesting ways to use TFX which is TFX in space.
For more, here is Marcel.
[MUSIC PLAYING]
MARCEL RUMMENS: This is one of the most important moments
for everyone involved in manned spaceflight.
This is why we at Airbus are working hard
and keep on innovating to ensure everyone on that space
station is safe and can return back to Earth,
to friends and family.
Hello everyone, my name is Marcel Rummens,
and I have the honor to tell you how Airbus uses anomaly
detection with TFX to ensure everyone's safety onboard
the International Space Station.
You might ask yourself, Airbus, space--
how does that fit?
Well, Airbus actually has many different products,
like commercial aircrafts, helicopters, satellites,
and we are also involved in manned spaceflight.
For example, the Columbus Module,
which is part of the ISS.
It was built and designed by Airbus
and finally launched in 2008.
It is used for experiments in space--
for example, in the field of biology,
chemistry, material science, or medicine.
As you kind of mentioned, such a module produces a lot of data.
To give you an idea, we have recorded
between 15,000 and 20,000 parameters per second
for the last 10 years.
And every second, we receive another 17,000 parameters.
So we are talking about trillions and trillions
of data points.
But what does this data actually represent?
Well, it could be any kind of sensor data,
and we want to detect anomalies in it to prevent accidents.
Let me give you an example.
If you have a power surge in your electrical system,
this could cause a fire.
If you have a drop in temperature or pressure,
this could mean that you have a hole in your module
that you need to fix.
So it is very important that we detect these anomalies
and fix them before something can happen.
Because just imagine for a second
it is you up there, 400 kilometers above earth,
and you only have these few inches of metal and plastic
to protect you from space.
It's a cold and dark place without any oxygen.
If something would have happened to this little protection
layer of yours, you would be in a life-threatening situation.
This is why we already have countless autonomous systems
on board the spacecraft to prevent
these kinds of accidents.
But with more and more automation,
we can handle more complex data streams,
increasing our precision offset predictions.
But it is not about safety alone.
It is also about plannability and predictive maintenance,
because the sooner we know that a certain part needs
replacement, the sooner we can schedule and plan
a supply mission, decreasing the cost of said launches.
How does this work?
Well, right now this is a manual process.
Our automation works in parallel with the engineers.
Let me run you through it.
We have our database on premise, storing all these trillions
and trillions of data points.
And then we use a Spark cluster to extract the data
and remove secret parts from it, because some things we are not
allowed to upload.
Then we use TFX on KubeFlow to train the model.
First, we use TF Transform to prepare the data,
and then we use TF Estimator to train
a model which tries to represent the normal state
of a subsystem--
so a state without any anomalies.
Once we've done enough hyperparameter tuning,
we are happy with the model we deployed using TF Serving.
Now, here comes the interesting part.
We have the ISS streaming data to ground stations
on earth, which stream data to our NIfTI cluster
running in our data center.
Here we remove the secret part of the data
again and then stream it into the cloud.
In the cloud, we have a custom Python application
running on Kubernetes which does the actual anomaly detection.
It will create a model.
The model will try to predict the current state
of a subsystem based in whatever it has seen in the past.
And then using its prediction and the reality
coming from the space station, we
can calculate a so-called representation arrow.
If this arrow is above a certain threshold,
we can use this as an indicator for an anomaly.
Now, if we have an anomaly, we would create a report
and then compare it against a database of previously happened
anomalies, because if something like this happened in the past,
we can reuse the information we have on it.
And then the final step is to hand this over
to an engineer, who will then fix the problem.
And this is very, very important to us,
because we are talking about human life on that space
station, so we want a human to make the final decision
in this process.
But this is TF Dev Summit, so I want
to have at least one slide about our model architecture.
We are using LSTM-based autoencoder with dropout,
and then we replaced the inner layers--
so the layers between encoder and decoder--
with LSTMs instead of dense layers,
because our tests have shown that sequences just better
represent the kind of information
we have, producing less false positives.
What kind of impact did this project have?
Well, we have been able to reduce our cost by about 44%.
Some of this is projected because, as I said earlier,
we are running in parallel right now.
But the cost/benefit mainly comes from the fact
that our engineers can dedicate more and more time on more
important, less repetitive tasks, tasks for which you
really need that human creativity and intuition
to fix these problems.
Also, our response time has decreased.
In the past, it could take hours, days, or sometimes even
weeks to find and fix a problem.
Now, we are talking about minutes, maybe hours.
Another benefit is that now we have a central storage
of all anomalies that ever happened,
plus how they have been fixed.
This is not just good for our customer, because we
have better documentation, but also great for us
because, for example, it simplifies the onboarding
process for new colleagues.
Next steps, looking into the future.
We want to extend the solution to more subsystems and more
products, like Bartolomeo, which is the latest
addition to the Columbus Module and scheduled to be launched
later this month.
But overall, these kinds of technologies
are very important for future space missions,
because as plans to go to moon and Mars become more and more
concrete, we need new ways and new technologies
to tackle problems like latency and the limited amount
of computational hardware onboard the spacecraft.
Coming back to TFX, we want to integrate more components--
for example, the model validator--
because now we have more labeled data
that allows us to actually do this automatic model
validation.
And finally, the migration to TF 2,
which is ongoing but very important to us, because
of course, we want to keep up and use the latest
version of TensorFlow.
If you have any questions or want
to learn more about the challenges
we faced during that project, have a look at the Google blog.
We will publish a blog post in the coming weeks
that goes into more detail than I could do in 10 minutes.
Before I close, I want to especially thank
[? Philip ?] [INAUDIBLE],, Jonas Hansen,
as well as everyone else of the ISS analytics
team for their incredible work and support,
as well as anyone else who helped to prepare this talk.
If you have any questions, feel free to find me
on the internet.
Write me those questions--
I'm happy to answer them.
And also, I will be available in the Q&A
section of this livestream.
Thank you very much and goodbye.
ZHITAO LI: Thank you, Marcel, for the great video.
And if you want to learn more about TFX,
please check out our website, which
has all the blog post, user guide tutorials, and API docs.
Please also feel free to check out our GitHub repo
with all the source code and engage about us
on the Google Group.
[MUSIC PLAYING]