雲端AI平臺上的分佈式TensorFlow模型訓練（TF Dev Summit '20）。 (Distributed TensorFlow model training on Cloud AI Platform (TF Dev Summit '20))

字幕列表影片播放

[MUSIC PLAYING]
YANG FAN: I'm Yang Fan.
I'm a senior engineer at a machine learning platform
at Cruise.
Today, I'd like to tell you some work
we have done the last year, how did the platform team build
a machine learning training for our Cruise.
So [INAUDIBLE] Cruise is a San Francisco based company.
We are building the world's most advanced self-driving
ridesharing technology, operable on the San Francisco street.
If you visit San Francisco city, you
may have a chance to see our orange task
cars running on the streets.
In order to operate the driverless ridesharing service
in San Francisco, our cars have to handle
many complex situations.
Every day, our test vehicles run through many busy streets.
And they interact with double-parked cars, cyclists,
delivery trucks, emergency vehicles, pedestrians,
and even with their pets.
So we see many interesting things on streets
all over the city.
Our cars are designed to handle most of those situations
on it's own.
So on our cars are multiple cameras, [INAUDIBLE] sensors
detecting the surroundings environment and to make
decisions at the same time.
Many [INAUDIBLE] tasks are mission critical and driven
by machine imaging models.
Those machine learning models are trained by a lot of data.
You can just see the front row data.
Our training data format is very complex and highly dimensional.
So we have two work streams in the model development.
One is to continuously retrain using the new parameters
and data.
The other is to develop the experiments and the new models.
So on the right side, the chart shows
the number of model training jobs by each week.
As you can see, the number of training jobs
varies week to week.
And in the meantime, we want to train machine models fast
but not too costly.
As a platform, we want to fulfill requirements
for both our machine engineers and also for the company.
The engineer sides, engineers want
to train models at any time.
The tools and support of the frameworks
are flexible without too much constraints.
Our machine's engineers.
So once you see jobs start, as soon as they submit training
jobs, while they are able to see the results as
early as possible.
More importantly, they want to have
an extremely natural and easy experience using our platform.
On the company side, we need to make sure
that we spent every penny wisely.
It means we need to run our model
training jobs efficiently.
More importantly, we need to allow our engineers
to focus on building the mission critical softwares
to impact the car performance.
So in order to fulfill those requirements,
we decided to build our model training on top of a Google's
cloud AI platform.
The AI platform offers a fully-managed training service
through command line tools and the web APIs.
So we could launch our jobs as a single machine
or as many machines as our coder allows.
Therefore, we can let our machine trainers
to scale up the training jobs if they want to.
AI platform also provides very customizable hardwares.
They could launch our training jobs
on a combination of different GPU, CPU and the memory
requirements, as they are compatible with each other.
The AI platform training service also
provide good connectivities to other Google service,
for example, like storage service in BigQuery.
In order to train model on multiple machines efficiently,
we also need good distribution training strategy.
We decided to Horovod.
Horovod is distribution training framework open sourced by Uber.
With a few lines changed in our model training code,
we can scale up our training jobs
beyond a single machine limit.
So far, we have tested a training model
using from 16 GPUs to 256 GPUs.
While we skill up the training cluster,
we noticed the deficiency decreased.
That is mostly because of the communication overhead.
When there are more GPUs, the more communication
CPUs there are.
To find the most efficient setup for the training job, at least
two factors to be considered.
One is the [INAUDIBLE] unit cost.
The other is the total training time.
On the chart on the right side, it
demonstrates one training job example.
So if we are training one million images
at a one image per second per GPU using Nvidia [INAUDIBLE]
and the machine time is the high mapping sense [INAUDIBLE]
use with 64 cores.
As you could imagine, using more GPU
could reduce the total training time.
However, to many GPUs would also mean reduce
the [INAUDIBLE] efficiency.
So the blue line on the chart becomes the flat from left
to right, while the number of GPUs increases.
While more GPU can save training time,
the average cost is increasing.
So the red line is showing a total cost
of going up from left to right on the chart.
For this particular job, we found
that using between 40 to 50 GPUs could be most cost effective.
We decided to build an automated system to provide the best
produce to our users.
This is the high level architecture diagram
of our district training system.
Users interact with a command-line tool.
The command-line tool will patch the call
and dependencies with parameters into our training jobs,
then submit a job to our governance service.
The service exam the job parameters
to prevent any accidental misconfigured parameters.
For example, if this using too many GPUs or too much memories.
At the same time, the service will translate the computer
requirements into actual machine type AI platform.
So users don't need to memorize all the different types
of documents neither the cost model themselves.
All of our training jobs are registered
into a [INAUDIBLE] tracking system, where
we can keep the reference to the job source
code and its parameters.
When we want to analyze the model performance,
we can trace back to the job information if that's needed.
The training input and output are
stored in Google Cloud, the cloud storage service.
That's [INAUDIBLE] for short.
[INAUDIBLE] provides the cost efficient storage
and the highest throughput for our model training jobs.
Once the job start to produce some metrics and results,
you also can view them through [INAUDIBLE] service.
While the job is running, our monitoring service
constantly pulls the job metrics through AI platform,
APIs, and feeds them into a data log.
So far, we care about a GPU and CPU [INAUDIBLE]
and job duration.
When [INAUDIBLE] is too low, it could mean that job gets stuck.
And it doesn't need resource.
Then we could notify the jobs--
we could notify the users to inspect a job
or adjust the machines that have to save some cost.
Our internal framework is a bridge between the training
system and our users.
It provides our [INAUDIBLE] and the [INAUDIBLE] interactions
with the training backhand.
So [INAUDIBLE] can focus on writing a lot of code.
We also [INAUDIBLE] the distribution
training strategy by changing one line in the configuration.
As I mentioned in the previous slide,
the [INAUDIBLE] tool automatically
packages [INAUDIBLE] and a submitted training job.
The framework allows the users to specify [INAUDIBLE]
by a number of GPUs.
Then our training framework will have the backhand to figure out
the actual machine set up.
The frame while also provides the interface for governance
and the monitoring services.
This slide demonstrates to the user
how do they turn on the Horovod trainings in the config.
Our framework, we automatically wrap the optimizer
into Horovod's digital optimizer.
The framework will also change some other place
in the model behavior.
One important and change in the model change
is the how to start a process.
Horovod uses MPI ROM or Horovod ROM as the main process.
It sends the command to other workers in the cluster.
So in order to use Horovod in the AI platform,
we wrote a bootstrap script to use it on the master.
From the master, the bootstrap script
will send the commands on each worker
to fetch the GPU configuration.
Use this information and then use this information
to set up MPI parameters.
[INAUDIBLE] the command that we use
to fetch the GPU information on the workers
is the [INAUDIBLE] CPU.
Because we have immediate GPUs on the workers.
So we could use Nvidia's driver SMI to the fetch GPU used.
However, a different Nvidia driver version
would place the [INAUDIBLE] rate as a different location.
We had to find the [INAUDIBLE] rate
before we could executed it.
Once we had a GPU list, we used the script
to pause the command output and then
use that to the MPI command.
Different from the regular TensorFlow training job,
we don't want to pause the training process
for periodic evaluation.
That's because we have to keep all the workers running
at the same pace.
If one worker paused for evaluation,
the other worker will have to wait for it.
Then the entire job would have failed because of communication
time out.
We decided to a move evaluation jobs onto [INAUDIBLE]..
So all workers can [INAUDIBLE] at a full speed.
The evaluation process on the master
monitors check points [INAUDIBLE]..
Each neutral point will trigger a new evaluation process
until the training process completes
and the last checkpoint is evaluated.
Last, the feature we provide in the framework
is the error handling due to the system errors.
This the one example of a very long running job.
This job ran for more than 24 hours.
And to the middle, there was a network event
disturbing the training job.
[INAUDIBLE] the workers shut down due
to the too many errors.
Because all workers have to be synchronized,
shutting down one worker would have failed the entire job.
Our framework detected the error and waited for the worker
to come back and then automatically restarted the job
using restored parameters from the last two
written check points.
So the job can continue and eventually
complete without any human risk here.
This year, we started exploring TensorFlow 2.
One of the attractive features in TensorFlow 2
is natively supported [INAUDIBLE] training strategy.
The multi-worker mirrored strategy would provide a better
experience as opposed to collective ops, either
the RING [INAUDIBLE] through gRPC or NCCL, Nvidia's
collective [INAUDIBLE] library.
NCCL has many cool features.
For example, it has improved efficiency at an even larger
scale than before.
The cluster could include tens of thousands of GPUs.
It also improved the topology detection.
It could either create a ring or tree, depending
on the CPU interconnections.
Google AI platform is going to support 2.0 distribution
strategy out of the box.
So you also no longer need to write
the bootstraps script himself.
So if you are building model training info for your team,
I have four takeaways for you.
Firstly, try not to reinvent the wheel.
There are many solutions available on the market.
And reach out to them.
I believe they are willing to solve the problem for you.
We have received a significant amount of support
for the AI platform team.
The collaboration between us allowed the crews
to focus on building the best self-driving technology.
Meanwhile, AI platform team could
build the best machine learning service
by learning the customer needs from us.
Secondly, be aware of cost.
One of our responsibility is to have a company operate
as sustainable.
Although model training can be very expensive,
there are many ways to save the cost.
The info team not only need to provide the hardware
and software solution, but also the best practice improving
the efficiency.
Certainly, rely on the community.
I'd like to thank Uber for open sourcing the Horovod framework.
Also, TensorFlow team on answering the fixing bugs
and delivering new features.
I also want to thank the many excellent open source
projects in the communities.
We learned many good things from them.
Also, we are looking for opportunities
to contribute back.
The last but not least is the developer experience.
As the info team at Cruise, we serve
of our internal customers, machine engineers.
They are already facing many challenges day to day.
To use our platform shouldn't be another one.
If our tools and service could help our service--
our customers spin the wheel smoother and a little faster
each time, eventually, running a flying wheel
will be effortless.
So once again, we are Cruise.
We are building the world's most advanced
self-driving technology in San Francisco.
If not interested in us, please check out our website,
getcruise.com.
And please feel free to reach out to us.
Thank you for listening.
[MUSIC PLAYING]