TensorFlow企業版。用谷歌雲實現TensorFlow的生產化 (TF Dev Summit '20) (TensorFlow Enterprise: Productionizing TensorFlow with Google Cloud (TF Dev Summit '20))

字幕列表影片播放

[MUSIC PLAYING]
MAKOTO UCHIDA: Hello, everyone.
My name is Makoto, a software engineer
in TensorFlow enterprise as a part of Google Cloud.
Now that we have seen the great story about TensorFlow
in production at work, and its cool use cases even in space,
now I'm going to talk about the enterprise-grade application
with TensorFlow Enterprise.
So what does it mean to be TensorFlow enterprise?
What is so different?
What is so difficult?
Well, after talking to many customers,
we have identified a couple of key challenges
when it comes to enterprise grade ML.
First is the scale and the performance.
When it comes to production grade enterprise applications,
oftentimes the size of data, the scale of a model,
is beyond what fits into my laptop or workstations.
As a result, we need to think about this problem differently.
Second is the manageability.
When developing business applications,
it is better to not have to worry
about any nitty-gritty details of infrastructure complexity,
including managing software environment
and managing multiple machines in clusters and what not.
Instead, it is desirable to only have
to concentrate on the core business logic of your machine
learning applications so that it makes the most
benefit to your business.
Third is the support.
If your application is business critical and mission critical,
timely resolution to the bugs and issues and a commitment
to stable support for applications
is essential to continue operating your applications.
TensorFlow Enterprise brings a solution to those challenges.
Let's take a look at the cloud scale performance.
In a nutshell, with TensorFlow Enterprise,
we compile and ship a special build
of TensorFlow, specifically optimized for Google Cloud.
It is purely based on the open source TensorFlow,
but it also contains a specialized optimization
specifically for Google Cloud machines
that it services in the form or patches and add-ons.
Let's take a look at how it looks like in practice.
This code trained the model with potentially very large training
data, then maybe a terabyte of data,
maintained in Google Cloud Storage.
As you see it, it is no different
from any typical TensorFlow code written with a data set APIs,
except the path to the training file
is pointing to the Google Cloud Storage.
Under the hood, the optimized I/O reader specifically
made for Google Cloud Storage is making this performant even
with terabyte of training data, and it make
your training very performant.
This is another example that used training data
from BigQuery table, which is a data warehouse which
may maintain 100 millions of business data--
data warehouse.
This example is a little bit more involved, but still
similar to the standard data set API that all of you
are familiar with, for that--
that your model can still train in your familiar ways,
but under the hood, the optimized I/O of a BigQuery
can read many, many millions of rows in parallel
in an efficient way.
And it turns into Tensor so that your training can
continue with the performance.
This is a little comparison of the throughput when
large data is read from Google Cloud Storage,
with or without optimization that TensorFlow interface
brought in.
As you see it, there is a nice throughput gain.
The better I/O throughput performance actually
translates into the better utilization
of processes such a CPUs and GPUs
because I/O is no longer the bottleneck
of the entire training.
What it means is your training finishes faster
and your training wall time is shorter.
As a result, your cost of training
is actually cheaper, because the complete cost
is proportionate to the wall time
that you use the compute resources.
Now that you have some ideas about what kind of optimization
we were able to make to TensorFlow,
specifically for Google Cloud, let's see
how you actually get it and how you actually
take the benefit of it.
We do this through managed services.
We deliver TensorFlow Enterprise through a managed environment
which we call our Deep Learning Virtual Machine images
and Container images, where all the environment is
pre-managed and pre-configured, on top of standard Linux
distributions.
Most important is it has TensorFlow Enterprise build
pre-installed, together with all the dependencies,
including device drivers and a dependency to Python packages
with correct version combinations
or what not, as well as configuration
to the other services in Google Cloud.
Because this is just a normal virtual machine
image and container images, you can actually
deploy it in many different ways in Cloud.
Regardless of where you deploy it or how you deploy it,
the TensorFlow Enterprise optimization
is just there, so you can take the benefit
of all that good performance.
To get started, you only have to pick the TensorFlow Enterprise
image and desired resources such as CPUs
and RAMs, or optimized GPUs.
Start the virtual machine in just one command.
In the next moment, you can already
access existing machine that has TensorFlow Enterprise build
already pre-installed and pre-configured,
and it's ready to use so that you can immediately
start writing your code in the machine.
If you prefer a notebook environment,
JupyterLab is hosted and already stored in the VM, actually.
The only thing you have to do is you only
have to point your brother to the VM
and open up the JupyterLab and open up a notebook
so that you can start writing your TensorFlow code,
taking the benefit of TensorFlow Enterprise.
Once you have the satisfactory model
after many iterations of experimentations,
now is the time to train your model at the full scale.
It may not fit into the one machine
and you may want to take advantage of the distributed
training facility that TensorFlow
offers so that it can support the large scale of data
on the model.
For this, AI Platform Training is a managed service
that takes care of the distributed training
clusters and all other infrastructure
complexities on behalf of you.
More importantly, it drives the same TensorFlow Enterprise
container image, which is exactly
the same environment you have used
to build your actual model, so you can be confident
that your model just trains with full scale of data
under the managed training service.
You simply need to overlay your application code
on top of the TF Enterprise container image,
then issue one command to start a distributed training cluster.
This example is grabbing at 10 workers with larger machines
per each worker with 8 GPUs attached to each worker
to train-- potentially build a large data set
for your [INAUDIBLE] applications.
This example brings up a distributed training cluster
with all TensorFlow Enterprise optimization included,
with 10 worker distributions.
Now that you can train your model in a full enterprise
scale, it is the time to make it an end to end pipeline
to continue running it in production,
taking advantage of AI Platform Pipelines and TensorFlow
Extended.
AI Platform Pipelines is actually
hosted on the community's engine,
but what this mean is it can also drive exactly
the same TensorFlow Enterprise container image,
so that all optimization is still there,
and you can still be confident that your application
in the pipeline just runs because it
is all the same environment.
After end to end application runs in production,
the Enterprise-grade support becomes
essential to mitigate any risk of interruption
in the operation and also to continue
operating your application in a business critical manner.
Our way to mitigate this risk is to provide long-term support.
With open source TensorFlow, we typically
offer one year of maintenance window.
For TensorFlow Enterprise, we provide three hours of support.
That includes critical bug fixes and security patches.
And additionally and optionally, we
may backboard certain functionalities and features
from the future leaders of TensorFlow as we see demand.
As of today, we have TensorFlow Enterprise version
1.15 and 2.1 as our long-term supported versions.
If your business is pushing the boundary of AI
and if your business is sitting at the cutting edge of AI,
where normal application and use cases
are critical to your business model,
and also your business is heavily
relying on being able to continue innovating
on this space, we actually want to work with you
through the white-glove service program.
We engineers and creators of both TensorFlow and Google
Cloud are willing to work with your engineers and your data
scientists to mitigate any future bugs and issues that we
may not have seen yet to support your cutting edge applications,
to unblock you and together advance your applications, as
well as TensorFlow and TensorFlow
Enterprise as a whole.
Please check out the website and see
for detail of this white-glove service program.
Looking ahead, we are really excited to keep
working tightly together between TensorFlow teams and Google
Cloud teams.
As being the creators and experts and owners of the boss
products, we continue to make the optimizations that
have implemented TensorFlow for Google Cloud.
That includes better monitoring and debugging capabilities
to your TensorFlow code that runs in Cloud,
as well as integration of this capability
into Google Cloud tooling for the better productivity
of your applications.
We are also looking at the smoother integration
between TensorFlow, popular high-level APIs such as Keras
or Keras Tuner and managed training services,
as well as even more managed services,
such as a continuous TensorFlow dev
for the purpose of coherent and joyful developer experiences.
Please stay tuned.
This concludes my talk about TensorFlow Enterprise.
For more information and for the details,
please do check out the website.
Thank you very much.
[MUSIC PLAYING]