字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] MAKOTO UCHIDA: Hello, everyone. My name is Makoto, a software engineer in TensorFlow enterprise as a part of Google Cloud. Now that we have seen the great story about TensorFlow in production at work, and its cool use cases even in space, now I'm going to talk about the enterprise-grade application with TensorFlow Enterprise. So what does it mean to be TensorFlow enterprise? What is so different? What is so difficult? Well, after talking to many customers, we have identified a couple of key challenges when it comes to enterprise grade ML. First is the scale and the performance. When it comes to production grade enterprise applications, oftentimes the size of data, the scale of a model, is beyond what fits into my laptop or workstations. As a result, we need to think about this problem differently. Second is the manageability. When developing business applications, it is better to not have to worry about any nitty-gritty details of infrastructure complexity, including managing software environment and managing multiple machines in clusters and what not. Instead, it is desirable to only have to concentrate on the core business logic of your machine learning applications so that it makes the most benefit to your business. Third is the support. If your application is business critical and mission critical, timely resolution to the bugs and issues and a commitment to stable support for applications is essential to continue operating your applications. TensorFlow Enterprise brings a solution to those challenges. Let's take a look at the cloud scale performance. In a nutshell, with TensorFlow Enterprise, we compile and ship a special build of TensorFlow, specifically optimized for Google Cloud. It is purely based on the open source TensorFlow, but it also contains a specialized optimization specifically for Google Cloud machines that it services in the form or patches and add-ons. Let's take a look at how it looks like in practice. This code trained the model with potentially very large training data, then maybe a terabyte of data, maintained in Google Cloud Storage. As you see it, it is no different from any typical TensorFlow code written with a data set APIs, except the path to the training file is pointing to the Google Cloud Storage. Under the hood, the optimized I/O reader specifically made for Google Cloud Storage is making this performant even with terabyte of training data, and it make your training very performant. This is another example that used training data from BigQuery table, which is a data warehouse which may maintain 100 millions of business data-- data warehouse. This example is a little bit more involved, but still similar to the standard data set API that all of you are familiar with, for that-- that your model can still train in your familiar ways, but under the hood, the optimized I/O of a BigQuery can read many, many millions of rows in parallel in an efficient way. And it turns into Tensor so that your training can continue with the performance. This is a little comparison of the throughput when large data is read from Google Cloud Storage, with or without optimization that TensorFlow interface brought in. As you see it, there is a nice throughput gain. The better I/O throughput performance actually translates into the better utilization of processes such a CPUs and GPUs because I/O is no longer the bottleneck of the entire training. What it means is your training finishes faster and your training wall time is shorter. As a result, your cost of training is actually cheaper, because the complete cost is proportionate to the wall time that you use the compute resources. Now that you have some ideas about what kind of optimization we were able to make to TensorFlow, specifically for Google Cloud, let's see how you actually get it and how you actually take the benefit of it. We do this through managed services. We deliver TensorFlow Enterprise through a managed environment which we call our Deep Learning Virtual Machine images and Container images, where all the environment is pre-managed and pre-configured, on top of standard Linux distributions. Most important is it has TensorFlow Enterprise build pre-installed, together with all the dependencies, including device drivers and a dependency to Python packages with correct version combinations or what not, as well as configuration to the other services in Google Cloud. Because this is just a normal virtual machine image and container images, you can actually deploy it in many different ways in Cloud. Regardless of where you deploy it or how you deploy it, the TensorFlow Enterprise optimization is just there, so you can take the benefit of all that good performance. To get started, you only have to pick the TensorFlow Enterprise image and desired resources such as CPUs and RAMs, or optimized GPUs. Start the virtual machine in just one command. In the next moment, you can already access existing machine that has TensorFlow Enterprise build already pre-installed and pre-configured, and it's ready to use so that you can immediately start writing your code in the machine. If you prefer a notebook environment, JupyterLab is hosted and already stored in the VM, actually. The only thing you have to do is you only have to point your brother to the VM and open up the JupyterLab and open up a notebook so that you can start writing your TensorFlow code, taking the benefit of TensorFlow Enterprise. Once you have the satisfactory model after many iterations of experimentations, now is the time to train your model at the full scale. It may not fit into the one machine and you may want to take advantage of the distributed training facility that TensorFlow offers so that it can support the large scale of data on the model. For this, AI Platform Training is a managed service that takes care of the distributed training clusters and all other infrastructure complexities on behalf of you. More importantly, it drives the same TensorFlow Enterprise container image, which is exactly the same environment you have used to build your actual model, so you can be confident that your model just trains with full scale of data under the managed training service. You simply need to overlay your application code on top of the TF Enterprise container image, then issue one command to start a distributed training cluster. This example is grabbing at 10 workers with larger machines per each worker with 8 GPUs attached to each worker to train-- potentially build a large data set for your [INAUDIBLE] applications. This example brings up a distributed training cluster with all TensorFlow Enterprise optimization included, with 10 worker distributions. Now that you can train your model in a full enterprise scale, it is the time to make it an end to end pipeline to continue running it in production, taking advantage of AI Platform Pipelines and TensorFlow Extended. AI Platform Pipelines is actually hosted on the community's engine, but what this mean is it can also drive exactly the same TensorFlow Enterprise container image, so that all optimization is still there, and you can still be confident that your application in the pipeline just runs because it is all the same environment. After end to end application runs in production, the Enterprise-grade support becomes essential to mitigate any risk of interruption in the operation and also to continue operating your application in a business critical manner. Our way to mitigate this risk is to provide long-term support. With open source TensorFlow, we typically offer one year of maintenance window. For TensorFlow Enterprise, we provide three hours of support. That includes critical bug fixes and security patches. And additionally and optionally, we may backboard certain functionalities and features from the future leaders of TensorFlow as we see demand. As of today, we have TensorFlow Enterprise version 1.15 and 2.1 as our long-term supported versions. If your business is pushing the boundary of AI and if your business is sitting at the cutting edge of AI, where normal application and use cases are critical to your business model, and also your business is heavily relying on being able to continue innovating on this space, we actually want to work with you through the white-glove service program. We engineers and creators of both TensorFlow and Google Cloud are willing to work with your engineers and your data scientists to mitigate any future bugs and issues that we may not have seen yet to support your cutting edge applications, to unblock you and together advance your applications, as well as TensorFlow and TensorFlow Enterprise as a whole. Please check out the website and see for detail of this white-glove service program. Looking ahead, we are really excited to keep working tightly together between TensorFlow teams and Google Cloud teams. As being the creators and experts and owners of the boss products, we continue to make the optimizations that have implemented TensorFlow for Google Cloud. That includes better monitoring and debugging capabilities to your TensorFlow code that runs in Cloud, as well as integration of this capability into Google Cloud tooling for the better productivity of your applications. We are also looking at the smoother integration between TensorFlow, popular high-level APIs such as Keras or Keras Tuner and managed training services, as well as even more managed services, such as a continuous TensorFlow dev for the purpose of coherent and joyful developer experiences. Please stay tuned. This concludes my talk about TensorFlow Enterprise. For more information and for the details, please do check out the website. Thank you very much. [MUSIC PLAYING]
B1 中級 TensorFlow企業版。用谷歌雲實現TensorFlow的生產化 (TF Dev Summit '20) (TensorFlow Enterprise: Productionizing TensorFlow with Google Cloud (TF Dev Summit '20)) 0 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字