字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] WEI LIN: I'm a senior director of PAI, the platform of artificial intelligence in Alibaba. Today I will give you a brief introduction about our work. This is an overview of our computation platform. We have a global storage system with the heterogeneous resource. On top that we have uniform resource management to support all different types of computation framework, including PAI. Here's a snapshot of the PAI user interface. People can pull and try components, and build a workflow very easily. The system runs on top of millions of CPU cores, and thousands of GPU cores. Our single training job can scale up to a thousand workers, with billions of features and parameters. And we also have a public server in our cloud for the external user as well. Here is our key in our system design. We want to have some easy-going building blocks for AI application creators. We also provide-- cover the full web cycle for those developers, to give them a one-stop programming experience. Our core is our engine which can provide high performance, low cost, good flexibility, and extensibility. Since this is a TensorFlow dev summit, I will talk more about our work in the deep learning engine. Our ultimate goal is to let the developer focus on modelling their neural network. Net assistant, PAI, helps them run their model easily, efficiently, and to scale. How do we achieve that? We deeply leverage TensorFlow, because TensorFlow has a very good flexible and extensible system design. And we did a lot of in-depth optimization, which is listed on the right. Inside Alibaba, the recommendation system has billions of features that requires a thousand workers in training. We have to enhance TensorFlow, especially on the runtime and distributed training mechanism, to leverage the sparsity of the data better. We also improved the communication protocol to introduce this multi-layer, or reduced ring, to build on top of the network hierarchy topology. We also support different communication protocols, like RDMA and NCCL. In order to fully utilize new GPU architecture like [INAUDIBLE]. We enhanced a TensorFlow that can automatically run the model with mixed precision. From the initial results, we actually improved this three times in our real scenarios. We asked for easy programming. We worked with the community together, to improve the auto-parallelism. We introduced TAO, which is based on the XLA framework, and that did a lot of the optimization. Including the cost-based graph split, graph optimization, kernel fusion, and full-stage code-gen. On the inference side, we introduced the PAI-Blade tools, which has three levels of optimization. Here are some results. We can see that on those public models, we are on par with TensorRT. Slightly better. Recently, graph neural networks gained a lot of attention inside Alibaba. In our real scenario, we faced a more challenging [INAUDIBLE] , group, having four properties. They are large-scale heterogeneous, attributed, and dynamic. We had to enhance TensorFlow to solve those challenges. We also developed a general CN inference engine, with-- on FPGA, and integrated this engine with TensorFlow. We have deployed this solution to our CityBrain project in China. PAI also provide a lot of the SDK for the developer to accelerate their research, including Reinforcement Learning, Transfer Learning package, and the Computation Vision natural language processing. They are all built up on top of TensorFlow. Currently we already contributed a lot-- 50-- probably more than 50 PRs back. I came here to want to have more connection, and to try to share more of our work with the community. Thank you. [MUSIC PLAYING]
B1 中級 PAI:阿里巴巴的A.I.平臺(19年TF開發峰會) (PAI: Platform of A.I. in Alibaba (TF Dev Summit '19)) 1 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字