Placeholder Image

字幕列表 影片播放

  • [MUSIC PLAYING]

  • TRIS WARKENTIN: Hi, everyone.

  • I'm Tris Warkentin and I'm a product manager

  • on TensorFlow Extended, or TFX.

  • ZHITAO LI: Hi, my name is Zhitao Li.

  • I'm a tech lead manager from TFX Open Source.

  • TRIS WARKENTIN: At Google, putting machine learning models

  • into production is one of the most important things

  • our engineers and researchers do.

  • But to achieve this global reach and production readiness,

  • a reliable production platform is

  • critical to Google's success.

  • And that's the goal of TensorFlow Extended--

  • to create a stable platform for production ML at Google,

  • and a stable platform for you to build production-ready ML

  • systems, too.

  • So how does that work?

  • Our philosophy is to take modern software engineering

  • and combine it with what we've learned about machine learning

  • development at Google.

  • So what's the difference between writing code and doing machine

  • learning engineering?

  • In coding, you might build something

  • that one person can create end to end.

  • You might have untested code, undocumented code,

  • and code that's hard to reuse.

  • In modern software engineering, we

  • have solutions for all of those problems-- test-driven

  • development, modular designs, scalable performance

  • optimization, and much more.

  • So how is that different in machine learning development?

  • Well, a lot of the problems from coding still apply to ML,

  • but we also have a variety of new problems.

  • We have no problem statements.

  • We might need some continuous optimization.

  • We might need to understand when changes in data

  • will result in different shapes of our models.

  • We've been doing this at Google for a long time.

  • In 2007, we launched Sibyl, which

  • was our production scalable platform for production ML

  • here at Google.

  • And since 2016, we've been working on TFX,

  • and last year we open sourced it to make

  • it even easier for you to build production

  • ML in your platforms.

  • What does it look like in practice?

  • Well, the entirety of TFX as an end to end platform

  • runs from best practices all the way through to end to end.

  • From best practices, you don't even

  • have to use a single line of Google developed code

  • in order to get some of the best of TFX, all the way

  • through to end to end pipelines that allow you to deploy

  • scalable production scale ML.

  • This is what a pipeline might look like.

  • On the left side of the screen, you'll see data intake,

  • then it runs through the pipeline,

  • doing things like data validation, schema generation,

  • and much more in order to make sure

  • that you're doing things in a repeatable, testable,

  • consistent way and producing ML production results.

  • So it's hard to believe that we've only

  • been one year in open source for our end

  • to end pipeline offering, but we have

  • a lot of interesting things that we've done in 2019,

  • including building the foundations of metadata,

  • building basic 2.0 support for things like estimators,

  • as well as launches of Fairness Indicators and TFMA.

  • But we're definitely not done.

  • In 2020, we have a wide variety of interesting developments

  • coming, including NativeKeras on TFX,

  • which you'll hear more about from Zhitao later,

  • as well as TensorFlow Lite trainer rewrite

  • and some warm starting, which can make your machine learning

  • training a hundred times faster by using caching.

  • But we have something really exciting

  • that we're announcing today, which

  • you may have heard from Megan about in the keynote, which

  • is end to end ML pipelines.

  • These are our Cloud AI Platform Pipelines.

  • We're really excited about these, because they combine

  • a lot of the best of Google AI Platform with TFX

  • to create Cloud AI Platform Pipelines, available today.

  • Please check out our blog for more information.

  • You should be able to find it if you just Google

  • "Cloud API Platform Pipelines."

  • And now, can we please cut to the demo?

  • ZHITAO LI: So I'll be giving an explanation about this demo.

  • This is the Cloud AI Platform Pipelines page.

  • As you see, you can see all your existing Cloud AI Pipeline

  • clusters in this page.

  • We've already created one, and this page

  • can be found at AI Platforms Pipelines tab

  • from the left of the Google Cloud Console.

  • If you don't have any pipelines cluster created yet,

  • you can use the New Instance button to create a new one.

  • This gives you a click button experience

  • while creating clusters, which is usually

  • one of the difficult jobs in the past.

  • You can use the Config button to create a Cloud AI

  • Pipelines on Google Cloud.

  • This gives you Cloud AI Pipelines space on Kubenetes.

  • Runs on Google's GKE.

  • You can choose a class it will run it from, choose

  • the namespace where you want to create a class for inside,

  • and choose a name of a cluster.

  • Once you are done, you can simply click Deploy and Done.

  • Since I already have a cluster, I

  • will open up the Pipeline dashboard here.

  • In this page, you can see a list of demo pipelines

  • that you can play with.

  • You can see tutorials about creating pipelines and doing

  • various techniques, and you can use the Pipelines tab

  • from the left to view all your existing pipelines here.

  • Since this class is newly created,

  • there is no TFX pipelines in it yet.

  • We are going to use the newly launched TFX

  • templates to create the Cloud AI Pipelines in this cluster.

  • This is the Cloud AI Notebook.

  • I'm pretty much using this as a Python shell

  • to write some simple Python commands.

  • First, you set up your environments

  • and then making sure TFX is properly installed, together

  • with some other dependencies.

  • Making sure you have environment variables

  • like Path Properties set up, and the TFX version is up to date.

  • Now you're making sure you have a Google Cloud project--

  • perfect config.

  • In the Config, the Cloud AI Pipelines cluster endpoint.

  • Simply copy that from the URL into the Notebook shell.

  • Now, we're also making sure we create the Google Container

  • Image repo so that we can upload our containers, too.

  • Once that is done, we config the pipeline name and the project

  • directory.

  • Now we can use the Template Creation

  • to create a new template.

  • Since they are created, I'm going

  • to show the content created by the templates.

  • As you see, there are pipeline code in the pipeline.py file.

  • This includes our classical taxi pipeline

  • from TFX with all the components necessary to do

  • production machine learning.

  • There is configs.py, with some configurations

  • related to Google Cloud as well as some configuration about TFX

  • itself.

  • Once that is done, we enter the Templates Directory,

  • making sure all the templates valid are there.

  • You can even run some pre-generated unit test

  • on the features to making sure the configuration looks right.

  • Once that's done, you can then use TFX CLI command

  • to create a TFX pipeline on Google Cloud Pipelines page.

  • This will create a temporary image

  • with all your code and the dependencies.

  • Upload them to DCR, then create a pipeline using this container

  • image on the Pipelines page.

  • As we see, the pipeline compiles and the creation is successful,

  • and we go back to the Pipeline page.

  • Click on Refresh.

  • Boom-- we have our new pipeline.

  • Now, if we click through the pipeline,

  • you are going to see all the TFX components here

  • readily available.

  • We can create a test to run on this one.

  • And click on the Run.

  • We are going to see each of the components.

  • When they run, they will gradually

  • show up on the web page.

  • The first component should be ExampleGen.

  • So it has to be ExampleGen. Yes, it is there.

  • This component has started running.

  • You can click on it.

  • On the tab, you can look at the artifacts, inputs and outputs,

  • what Kubernetes' volumes used for the component, manifest,

  • and you can even inspect the logs of a component run.

  • We call this ExampleGen, StatsGen, SchemaGen.

  • And now the pipeline enters feature transform

  • and the example validation at the same time.

  • So now all the data preparation is finished.

  • The pipeline enters into a training stage,

  • which is producing a TensorFlow model.

  • If we click on the trainer component,

  • we can even inspect these logs.

  • Now, once trainer is complete, we

  • do some model validation and evaluation

  • using TFX components.

  • And once all the model evaluation is done,

  • we use Pusher to push the generated model

  • onto external serving system.

  • So everything-- so you have a model

  • ready to use in production.

  • You can also use the tabs on the left

  • to navigate on existing experiments, artifacts,

  • and executions.

  • We are going to take a look at the artifacts generated

  • from this pipeline using the Artifacts tab.

  • So here, you can see you have a pipeline.

  • If you click on the Model Output Artifacts from Trainer,

  • that represents a TensorFlow model.

  • This is the artifact ML metadata.

  • We can see it's a model artifact produced by trainer.

  • And this is a lineage view of the model--

  • this explains what components produced this model

  • from what input artifacts and how this artifact is further

  • used by other components which takes this one's input

  • and what outputs are generated from the downstream components.

  • OK this is all of the demo.

  • Now I'm going to talk about another important development

  • in TFX, which is supporting NativeKeras.

  • For those of you who are not very familiar with TensorFlow

  • 2, let me capture a little of the history.

  • TensorFlow 2 was released in Q3 2019

  • with a focus on providing a more Pythonic experience.

  • That includes supporting the Keras

  • API, eager execution by default, and the Pythonic execution.

  • So this is a timeline of how TFX Open Source has been working

  • on supporting all of this.

  • We released the first version with TFX Open Source

  • in the last Dev Summit, which only

  • supports estimator-based TensorFlow training code.

  • Back to last TensorFlow World, TensorFlow 2.0 was launched

  • and we started working on supporting the Keras API.

  • Previous slide, please?

  • Previous slide, please?

  • Back to Q4 2019.

  • We released the basic TensorFlow 2.0 support.

  • In TFX 0.20-- in that version, we supported TensorFlow 2.0

  • package end to end with a limited Keras support with

  • Keras Estimator.

  • And now, in the latest release TFX, I'm happy to release--

  • we are releasing experimental support of NativeKeras training

  • end to end.

  • So what does that mean?

  • Let's take a deeper look.

  • For data ingestion analysis, everything pretty much remains

  • the same, because TFTP, our data analysis library,

  • is model agnostic.

  • For future transform, we added a new Keras compatible layer

  • in TFT library so that we can transform features

  • in Keras model.

  • This layer will also take care of asset management

  • and the model exporting.

  • For training, we created a new generic trainer executor

  • which can be used to run any TensorFlow training code which

  • explores a saved model.

  • This also covers the training using NativeKeras API.

  • For model analysis and validation,

  • we create a new evaluator component,

  • which combines both evaluation and the model validation

  • capabilities.

  • This new component supports NativeKeras auto blocks.

  • And finally, when it gets to model serving validation,

  • we will release a new component called infra validator.

  • This component can be used to verify inference request, where

  • TensorFlow is serving binaries, to making

  • sure any exported TensorFlow model can be used correctly

  • in production, including anything

  • traded with NativeKeras.

  • Now, let's take a look at a case from one

  • of our partners, Concur Labs.

  • Concur Labs is a team in SAP Concur that explores new ideas

  • and are building prototypes.

  • They help all the developers in SAP Concur using ML effectively

  • in their solutions.

  • To do this, they need a modern machine learning pipeline

  • to secure scales that are facing their data platform.

  • They find that TFX with NativeKeras trainer

  • allows them to do more things.

  • One of the successful story here is the efficient BERT

  • deployment.

  • With the TF and the Keras, they can create simple models

  • by using TF Transform for the data pre-processing,

  • using state of the art models from TensorFlow hub

  • and create simple model deployments

  • with TensorFlow Serving.

  • The applications here covers some

  • of the sentimental analysis and some of the question and answer

  • type problems.

  • They also create a TFX pipelines to build

  • the pre-processing steps in export data BERT models.

  • We have just published a blog post on this,

  • so feel free to check out the TFX blog post.

  • Another successful story is TFX Pipelines for TFLite Models.

  • This means people-- they can create a TFX pipelines that

  • produced two models--

  • one in saved model format and the one

  • in TensorFlow Lite version.

  • This simplifies our pipeline building process

  • and it reduces manual conversion steps.

  • One of the most important things for the future of our ecosystem

  • is great partners, like all of you on the livestream.

  • We hope you join us to help making TFX work for your use

  • case.

  • Some of the areas we would love your help,

  • including portability, on-prem and the multi-cloud,

  • Spark/Flink/HDFS integration, as well as

  • data and the model governance.

  • One of the great things about TFX

  • is the wide diversity of the ways It can be used.

  • We have a exciting guest from--

  • Marcel from Airbus.

  • He could not be here physically today,

  • but he recorded a video to talk about one

  • of the interesting ways to use TFX which is TFX in space.

  • For more, here is Marcel.

  • [MUSIC PLAYING]

  • MARCEL RUMMENS: This is one of the most important moments

  • for everyone involved in manned spaceflight.

  • This is why we at Airbus are working hard

  • and keep on innovating to ensure everyone on that space

  • station is safe and can return back to Earth,

  • to friends and family.

  • Hello everyone, my name is Marcel Rummens,

  • and I have the honor to tell you how Airbus uses anomaly

  • detection with TFX to ensure everyone's safety onboard

  • the International Space Station.

  • You might ask yourself, Airbus, space--

  • how does that fit?

  • Well, Airbus actually has many different products,

  • like commercial aircrafts, helicopters, satellites,

  • and we are also involved in manned spaceflight.

  • For example, the Columbus Module,

  • which is part of the ISS.

  • It was built and designed by Airbus

  • and finally launched in 2008.

  • It is used for experiments in space--

  • for example, in the field of biology,

  • chemistry, material science, or medicine.

  • As you kind of mentioned, such a module produces a lot of data.

  • To give you an idea, we have recorded

  • between 15,000 and 20,000 parameters per second

  • for the last 10 years.

  • And every second, we receive another 17,000 parameters.

  • So we are talking about trillions and trillions

  • of data points.

  • But what does this data actually represent?

  • Well, it could be any kind of sensor data,

  • and we want to detect anomalies in it to prevent accidents.

  • Let me give you an example.

  • If you have a power surge in your electrical system,

  • this could cause a fire.

  • If you have a drop in temperature or pressure,

  • this could mean that you have a hole in your module

  • that you need to fix.

  • So it is very important that we detect these anomalies

  • and fix them before something can happen.

  • Because just imagine for a second

  • it is you up there, 400 kilometers above earth,

  • and you only have these few inches of metal and plastic

  • to protect you from space.

  • It's a cold and dark place without any oxygen.

  • If something would have happened to this little protection

  • layer of yours, you would be in a life-threatening situation.

  • This is why we already have countless autonomous systems

  • on board the spacecraft to prevent

  • these kinds of accidents.

  • But with more and more automation,

  • we can handle more complex data streams,

  • increasing our precision offset predictions.

  • But it is not about safety alone.

  • It is also about plannability and predictive maintenance,

  • because the sooner we know that a certain part needs

  • replacement, the sooner we can schedule and plan

  • a supply mission, decreasing the cost of said launches.

  • How does this work?

  • Well, right now this is a manual process.

  • Our automation works in parallel with the engineers.

  • Let me run you through it.

  • We have our database on premise, storing all these trillions

  • and trillions of data points.

  • And then we use a Spark cluster to extract the data

  • and remove secret parts from it, because some things we are not

  • allowed to upload.

  • Then we use TFX on KubeFlow to train the model.

  • First, we use TF Transform to prepare the data,

  • and then we use TF Estimator to train

  • a model which tries to represent the normal state

  • of a subsystem--

  • so a state without any anomalies.

  • Once we've done enough hyperparameter tuning,

  • we are happy with the model we deployed using TF Serving.

  • Now, here comes the interesting part.

  • We have the ISS streaming data to ground stations

  • on earth, which stream data to our NIfTI cluster

  • running in our data center.

  • Here we remove the secret part of the data

  • again and then stream it into the cloud.

  • In the cloud, we have a custom Python application

  • running on Kubernetes which does the actual anomaly detection.

  • It will create a model.

  • The model will try to predict the current state

  • of a subsystem based in whatever it has seen in the past.

  • And then using its prediction and the reality

  • coming from the space station, we

  • can calculate a so-called representation arrow.

  • If this arrow is above a certain threshold,

  • we can use this as an indicator for an anomaly.

  • Now, if we have an anomaly, we would create a report

  • and then compare it against a database of previously happened

  • anomalies, because if something like this happened in the past,

  • we can reuse the information we have on it.

  • And then the final step is to hand this over

  • to an engineer, who will then fix the problem.

  • And this is very, very important to us,

  • because we are talking about human life on that space

  • station, so we want a human to make the final decision

  • in this process.

  • But this is TF Dev Summit, so I want

  • to have at least one slide about our model architecture.

  • We are using LSTM-based autoencoder with dropout,

  • and then we replaced the inner layers--

  • so the layers between encoder and decoder--

  • with LSTMs instead of dense layers,

  • because our tests have shown that sequences just better

  • represent the kind of information

  • we have, producing less false positives.

  • What kind of impact did this project have?

  • Well, we have been able to reduce our cost by about 44%.

  • Some of this is projected because, as I said earlier,

  • we are running in parallel right now.

  • But the cost/benefit mainly comes from the fact

  • that our engineers can dedicate more and more time on more

  • important, less repetitive tasks, tasks for which you

  • really need that human creativity and intuition

  • to fix these problems.

  • Also, our response time has decreased.

  • In the past, it could take hours, days, or sometimes even

  • weeks to find and fix a problem.

  • Now, we are talking about minutes, maybe hours.

  • Another benefit is that now we have a central storage

  • of all anomalies that ever happened,

  • plus how they have been fixed.

  • This is not just good for our customer, because we

  • have better documentation, but also great for us

  • because, for example, it simplifies the onboarding

  • process for new colleagues.

  • Next steps, looking into the future.

  • We want to extend the solution to more subsystems and more

  • products, like Bartolomeo, which is the latest

  • addition to the Columbus Module and scheduled to be launched

  • later this month.

  • But overall, these kinds of technologies

  • are very important for future space missions,

  • because as plans to go to moon and Mars become more and more

  • concrete, we need new ways and new technologies

  • to tackle problems like latency and the limited amount

  • of computational hardware onboard the spacecraft.

  • Coming back to TFX, we want to integrate more components--

  • for example, the model validator--

  • because now we have more labeled data

  • that allows us to actually do this automatic model

  • validation.

  • And finally, the migration to TF 2,

  • which is ongoing but very important to us, because

  • of course, we want to keep up and use the latest

  • version of TensorFlow.

  • If you have any questions or want

  • to learn more about the challenges

  • we faced during that project, have a look at the Google blog.

  • We will publish a blog post in the coming weeks

  • that goes into more detail than I could do in 10 minutes.

  • Before I close, I want to especially thank

  • [? Philip ?] [INAUDIBLE],, Jonas Hansen,

  • as well as everyone else of the ISS analytics

  • team for their incredible work and support,

  • as well as anyone else who helped to prepare this talk.

  • If you have any questions, feel free to find me

  • on the internet.

  • Write me those questions--

  • I'm happy to answer them.

  • And also, I will be available in the Q&A

  • section of this livestream.

  • Thank you very much and goodbye.

  • ZHITAO LI: Thank you, Marcel, for the great video.

  • And if you want to learn more about TFX,

  • please check out our website, which

  • has all the blog post, user guide tutorials, and API docs.

  • Please also feel free to check out our GitHub repo

  • with all the source code and engage about us

  • on the Google Group.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

TFX:2020年用TensorFlow生產ML(TF Dev Summit '20)。 (TFX: Production ML with TensorFlow in 2020 (TF Dev Summit '20))

  • 2 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字