PAI：阿里巴巴的A.I.平臺(19年TF開發峰會) (PAI: Platform of A.I. in Alibaba (TF Dev Summit '19))

字幕列表影片播放

[MUSIC PLAYING]
WEI LIN: I'm a senior director of PAI,
the platform of artificial intelligence in Alibaba.
Today I will give you a brief introduction about our work.
This is an overview of our computation platform.
We have a global storage system with
the heterogeneous resource.
On top that we have uniform resource management
to support all different types of computation framework,
including PAI.
Here's a snapshot of the PAI user interface.
People can pull and try components, and build
a workflow very easily.
The system runs on top of millions of CPU cores,
and thousands of GPU cores.
Our single training job can scale up to a thousand workers,
with billions of features and parameters.
And we also have a public server in our cloud
for the external user as well.
Here is our key in our system design.
We want to have some easy-going building blocks
for AI application creators.
We also provide-- cover the full web cycle for those developers,
to give them a one-stop programming experience.
Our core is our engine which can provide
high performance, low cost, good flexibility, and extensibility.
Since this is a TensorFlow dev summit,
I will talk more about our work in the deep learning engine.
Our ultimate goal is to let the developer focus on modelling
their neural network.
Net assistant, PAI, helps them run their model easily,
efficiently, and to scale.
How do we achieve that?
We deeply leverage TensorFlow, because TensorFlow
has a very good flexible and extensible system design.
And we did a lot of in-depth optimization,
which is listed on the right.
Inside Alibaba, the recommendation system
has billions of features that requires
a thousand workers in training.
We have to enhance TensorFlow, especially on the runtime
and distributed training mechanism,
to leverage the sparsity of the data better.
We also improved the communication protocol
to introduce this multi-layer, or reduced ring,
to build on top of the network hierarchy topology.
We also support different communication protocols,
like RDMA and NCCL.
In order to fully utilize new GPU architecture
like [INAUDIBLE].
We enhanced a TensorFlow that can automatically run
the model with mixed precision.
From the initial results, we actually
improved this three times in our real scenarios.
We asked for easy programming.
We worked with the community together,
to improve the auto-parallelism.
We introduced TAO, which is based on the XLA framework,
and that did a lot of the optimization.
Including the cost-based graph split, graph optimization,
kernel fusion, and full-stage code-gen.
On the inference side, we introduced the PAI-Blade tools,
which has three levels of optimization.
Here are some results.
We can see that on those public models,
we are on par with TensorRT.
Slightly better.
Recently, graph neural networks gained a lot of attention
inside Alibaba.
In our real scenario, we faced a more challenging [INAUDIBLE] ,
group, having four properties.
They are large-scale heterogeneous, attributed,
and dynamic.
We had to enhance TensorFlow to solve those challenges.
We also developed a general CN inference engine,
with-- on FPGA, and integrated this engine with TensorFlow.
We have deployed this solution to our CityBrain project
in China.
PAI also provide a lot of the SDK
for the developer to accelerate their research,
including Reinforcement Learning, Transfer Learning
package, and the Computation Vision natural language
processing.
They are all built up on top of TensorFlow.
Currently we already contributed a lot--
50-- probably more than 50 PRs back.
I came here to want to have more connection,
and to try to share more of our work with the community.
Thank you.
[MUSIC PLAYING]