真實世界的機器人學習（TensorFlow Dev Summit 2018）。 (Real-World Robot Learning (TensorFlow Dev Summit 2018))

字幕列表影片播放

♪ (music) ♪
Alright, hi, everybody.
I'm Alex, from the Brain Robotics team,
and in this presentation, I'll be talking about how we use simulation
and domain adaptation in some of our real-world robot learning problems.
So, first let me start by introducing robot learning.
The goal of robot learning is to use machine learning
to learn robotic skills
that work in general environments.
What we've seen so far is that
if you control your environment a lot,
you can get robots to do pretty impressive things,
and where techniques start to break down
is when you try to apply these same techniques
to more general environments.
And the thinking is that if you use machine learning,
then you can learn from your environment,
and this can help you address these generalization issues.
So, as a step in this direction,
we've been looking at the problem of robotic grasping.
This is a project that we've been working on
in collaboration with some people at X.
And to explain our problem setup a bit,
we're going to have a real robot arm
which is learning to pick up objects out of a bin.
There is going to be a camera looking down
over the shoulder of the arm into the bin,
and from this RGB image we're going to train a neural network
to learn what commands it should send to the robot
to successfully pick up objects.
Now, we want to try to solve this task using as few assumptions as possible.
So, importantly, we're not going to give any information
about the geometry of what kinds of objects we are trying to pick up,
and we're also not going to give
any information about the depth of the scene.
So in order to solve the task,
the model needs to learn hand-eye coordination
or it needs to see where it is within the camera image,
and then figure out where in the scene it is,
and then combine these two to figure out how it should move around.
Now, in order to train this model, we're going to need a lot of data
because it's a pretty large scale image model.
And our solution at the time for this was to simply use more robots.
So this is what we call the "Arm Farm."
These are six robots collecting data in parallel.
And if you have six robots, you can collect data a lot faster
than if you only have one robot.
So using these robots, we were able to collect
over a million attempted grasps,
over a total of thousands of robot hours,
and then using this data we were able
to successfully train models to learn how to pick up objects.
Now, this works, but it still took a lot of time
to collect this dataset.
So this motivated looking into ways
to reduce the amount of real-world data needed
to learn these behaviors.
One approach for doing this is simulation.
So in the left video here,
you can see the images that are going into our model
in our real world setup,
and on the right here you can see
our simulated recreation of that setup.
Now, the advantage of moving things into simulation
is that simulated robots are a lot easier to scale.
We've been able to spin up thousands of simulated robots
grasping various objects,
and using this setup we were able to collect millions of grasps
in just over eight hours,
instead of the weeks that were required for our original dataset.
Now, this is good for getting a lot of data,
but unfortunately models trained in simulation
tend not to transfer to the actual real world robot.
There are a lot of systematic differences between the two.
One big one is the visual appearances of different things.
And another big one is just physical differences
between our real-world physics
and our simulated physics.
So what we did was, we were able to very quickly
train our model on simulation to get to around 90% grasp success.
We then deployed to the real robot,
and it succeeds just over 20% of the time,
which is a very big performance drop.
So in order to actually get good performance,
we need to do something a bit more clever.
So this motivated looking into Sim-to-Real transfer,
which is a set of transfer-learning techniques
for trying to use simulated data
to improve your real-world sample efficiency.
Now, there are a few different ways you can do this.
One approach for doing this is
adding more randomization into your simulator.
You can do this by changing around the textures
that you apply to different objects,
changing around their colors,
changing how lighting is interacting with your scene,
and you can also play around with changing the geometry of what kinds of objects
you're trying to pick up.
Another way of doing this is domain adaptation,
which is a set of techniques for learning
when you have two domains of data that have some common structure,
but are still somewhat different.
In our case the two domains are going to be our simulated robot data
and our real robot data.
And there are feature-level ways of doing this
and there are pixel-level ways of doing this.
Now, in this work, we tried all of these approaches,
and in this presentation, I'm going to focus primarily
on the domain adaptation side of things.
So, in feature-level domain adaptation
what we're going to do is we're going to take our simulated data,
take our real data,
train the same model on both datasets,
but then at an intermediate feature layer of the network,
we're going to attach a similarity loss.
And the similarity loss is going to encourage the distribution of features
to be the same across both domains.
Now, one approach for doing this which has worked well recently
is called Domain-Adversarial Neural Networks.
And the way these work is that the similarity loss is implemented
as a small neural net that tries to predict the domain
based on the input features it's receiving,
and then the rest of the model is trying
to confuse this domain classifier as much as possible.
Now, pixel-level methods try to work at the problem
from a different point of view.
Instead of trying to learn domain invariant features,
we're going to try to transform our images at the pixel level
to look more realistic.
So what we do here is we take a generative-adversarial network;
we feed it an image from our simulator,
and then it's going to output an image that looks more realistic.
And then we're going to use the output of this generator
to train whatever task model that we want to train.
Now we're going to train both
the generator and the task model at the same time.
We found that in practice, this was useful
because it helps ground the generator output
to be useful for actually training your downstream task.
Alright. So taking a step back,
feature-level methods can learn domain-invariant features
when you have data from related domains
that aren't quite identical.
Meanwhile, pixel-level methods can transform your data
to look more like your real-world data,
but in practice they don't work perfectly,
and there are still some small artifacts
and inaccuracies from the generator output.
So our thinking went, "Why don't we simply combine both of these approaches?"
We can apply a pixel-level method
to try to transform the data as much as possible,
and this isn't going to get us all the way there,
but then we can attach a feature-level method on top of this
to try to close the reality gap even further,
and combined these form what we call the grasp gen
which is a combination of both
pixel-level and feature-level domain adaptation.
In the left half of the video here
you can see a simulated grasp.
In the right half you can see the output of our generator.
And you can see that it's learning some pretty cool things
in terms of drawing what the tray should look like,
drawing more realistic textures on the arm,
drawing shadows that the objects are casting.
It's also learned how to even draw shadows
as the arm is moving around in the scene.
And it certainly isn't perfect.
There are still these little odd splotches of color around,
but it's definitely learning something
about what it means for an image to look more realistic.
Now, this is good for getting a lot of pretty images,
but what matters for our problem is whether these images are actually useful
for reducing them onto real-world data required.
And we find that it does.
So, to explain this chart a bit:
On the x-axis is the number of real-world samples used,
and we compared the performance of different methods
as we vary them onto real-world data given to the model.
The blue bar is our performance when we use only simulated data.
The red bar is our performance when we use only real data,
and the orange bar is our performance when we use both simulated and real data
and the domain adaptation methods that I've been talking about.
And what we found is that when we use just 2%
of our original real-world dataset,
and we apply domain adaptation to it,
we're able to get the same level of performance
so this reduces the number of real-world samples we needed
by up to 50 times, which is really exciting
in terms of not needing to run robots for a large amount of time
to learn these grasping behaviors.
Additionally, we found that even when we give
all of the real-world data to the model,
when we give simulated data as well,
we're still able to see improved performance
so that implies that we haven't hit the data capacity limits
for this grasping problem.
And finally, there's a way to train this setup
without having real-world labels,
and when we trained the model in this setting,
we found that we were still able to get pretty good performance
on the real-world robot.
Now, this was the work of a large team
across both Brain as well as X.
I'd like to thank all of my collaborators.
Here's a link to the original paper.
And I believe there is also a blog post,
if people are interested in hearing more details.
Thanks.
(applause)
♪ (music) ♪