字幕列表 影片播放
OSCAR RAMIREZ: All right.
Well, thank you, everyone.
So I'm Oscar Ramirez.
This is Sergio.
And we're from the TF-Agents team.
And we'll talk to you guys about our project.
So for those of you that don't know,
TF-Agents is our reinforcement Learning library
built in TensorFlow.
And it's hopefully a reliable, scalable,
and easy-to-use library.
We packaged it with a lot of Colabs, examples,
and documentation to try and make it easy for people
to jump into reinforcement learning.
And we use it internally to actually solve
a lot of difficult tasks with reinforcement learning.
In our experience, it's been pretty easy to develop new RL
algorithms.
And we have a whole bunch of tests,
making it easy to configure and reproduce results.
A lot of this wouldn't be possible without everyone's
contribution, so I just want to make it clear,
this has been a team effort.
There has spent a lot of 20 percenters,
external contributors.
People have come and gone within the team, as well.
And so this is right now the biggest chunk
of the current team that is working on TF-Agents.
With that, I'll let Sergio talk a bit more about RL in general.
SERGIO GUADARRAMA: Thank you, Oscar.
Hi, everyone.
So we're going to focus a little more about reinforcement
learning and how this is different from other kinds
of machine learning-- unsupervised learning,
supervised learning, and other flavors.
Here's three examples.
One is a robotics game.
And the other one is a recommendation system.
That's a clear example where you can
apply reinforcement learning.
So the basic idea is--
so if you were to try to teach someone how to walk,
it's very difficult, because it's really difficult for me
to explain to you what you need to do to be able to walk--
coordinate your legs, in this case, of the robot-- or even
for a kid.
How you teach someone how to walk is really difficult.
They need to figure it out themselves.
How?
Trial and error.
You try a bunch of times.
You fall down.
You get up, and then you learn as you're falling.
And that's basically-- you can think of it like the reward
function.
You get a positive reward or a negative reward
every time you try.
So here, you can see also, even with the neural algorithms,
this thing is still hopping, no?
After a few trials of learning, this robot
is able to move around, wobble a little bit, and then fall.
But now he can control the legs a little more.
Not quite walk, but doing better than before.
After fully-trained, then, the robot
is able to walk from one place to another,
basically go to a specific location, and all those things.
So how this happen basically is summarizing this code.
Well, there's a lot of code, but overall, the presentation
will go about the details.
Basically, we're summarizing all the pieces
you will need to be able to train a model like this,
but we will go into details.
So what is reinforcement learning,
and how that is different, basically, for other cases?
The idea is we have an agent that
is trying to play, in this case, or interact
with an environment.
In this case, it's like Breakout.
So basically, the idea is you need
to move the paddle to the left or to the right to hit the ball
and break the bricks on the top.
So this one generates some observation
that the agent can observe.
It can basically process them for observation,
generate a new action-- like whether to move the paddle
to the left to the right.
And then based on that, they will get some reward.
In this case, it will be the score.
And then, using that information,
it will learn from this environment how to play.
So one thing with this, I think, is
critical for people who have done
a lot of supervised learning is, what
is the main difference between supervised learning
and reinforcement learning-- is that for supervised learning,
you can think of, for every action
that you take, they give you a label.
An expert will have labeled that case.
That is simple.
It'll give you the right answer.
For this specific image, this is an image of a dog.
This is an image of a cat.
So you know what is the right answer,
so every time you make a mistake,
I will tell you what is the right answer to that question.
In reinforcement learning, that doesn't happen.
Basically, you are playing this game.
You are interacting with the game.
You [? bench ?] a batch of actions,
and you don't know which one was the right action, what
was the correct action, and what was the wrong one.
You only know this reward function tells you, OK, you
are doing kind of OK.
You are not doing that well.
And based on that, you need to infer, basically,
what other possible actions you could have taken to improve
your reward, or maybe you're doing well now, but maybe later
you do worse.
So it's also a dynamic process going on over here.
AUDIENCE: How is the reward function
different from the label?
SERGIO GUADARRAMA: So I think the main difference is this.
The reward function is only an indicator
you are doing well or wrong, but it
doesn't tell you what is the precise action you
need to take.
The label is more like the precise outcome of the model.
You can think, in supervised learning,
I tell you what is the right action.
I tell you the right answer.
If I give you a mathematical problem, I'm going to say,
x is equal to 2.
That is the right answer.
If I tell you, you are doing well,
you don't know what was the actual answer.
You don't know if it was x equal 2 or x equal 3.
If I tell you it's the wrong answer,
you're going to know what was the right answer.
So basically that's the main difference between having
a reward function that only indicates--
it gives you some indication about whether you are doing
well or not, but doesn't give you the proper answer--
or the optimal answer, let's say.
AUDIENCE: Is the reward better to be very general instead
of very specific?
SERGIO GUADARRAMA: Mhm.
AUDIENCE: Like you are doing well,
instead of what you are moving is the right direction to go.
OSCAR RAMIREZ: It depends quite a bit on the environment.
And there is this whole problem of credit assignment.
So trying to figure out what part of your actions
were the ones that actually led to you receiving this reward.
So if you think about the robot hopping,
you could give it a reward, that may be
its current forward velocity.
And you're trying to maximize that,
and so the robot should learn to run as fast as possible.
But maybe bending the legs down so
you can push yourself forward will help you move forward
a lot faster, but maybe that action will actually move you
backwards a little bit.
And you might even get punished instantaneously
for that action, but it's part of the whole set of actions
during an episode that will lead you to moving forward.
And so the credit assignment problem is like,
all right, there are a set of actions
that we might have even gotten negative reward,
but we need to figure out that those actions led
to positive reward down the line.
And the objective is to maximize the discounted return.
So a sum of rewards over a length of time steps.
SERGIO GUADARRAMA: Yeah, that's a thing that's a critical part.
We care about long-term value.
It's not totally immediate reward.
It's not only you telling me plus 1.
It's not so important, because I want
to know not if I'm playing the game right now.
If I'm going to win the game at the end,
that's where I really care.
I am going to be able to move the robot to that position.
What we're in the middle, sometimes those things are OK.
Some things are not bad.
But sometimes, I make an action.
Maybe I move one leg and I fall.
And then I could not recover.
But then maybe it was a movement I did 10 steps ago
would make my leg wobble.
And now how do I connect which action made me fall.
Sometimes it's not very clear.
Because it's multiple actions-- in some cases,
even thousands of actions-- before you
get to the end of the game, basically.
You can think also that, in the games, now
that I've gotten all those things,
is this [? stone ?] really going to make you lose?
Probably there's no single [? stone ?]
that's going to make you lose, but 200 positions down
the line, that [? stone ?] was actually very critical.
Because that has a ripple effect on other actions
that happen later.
And then to you need to be able to estimate this credit
assignment for which actions I need to change to improve
my reward, basically, overall.
So I think this is to illustrate also a little farther,
different models of learning.
What we said before, supervised learning
is more about the classical classroom.
There's a teacher telling you the right
answer, memorize the answer, memorize that.
And that's what we do in supervised learning.
We almost memorize the answers with some generalization.
Mostly that's what we do.
And then in reinforcement learning,
it's not so much about memorize the answer.
Because even if I do the same actions,
in a different setting, if I say,
OK, go to the kitchen in my house, and I say,
oh, go to the left, second door to the right.
And then I say, OK, now go to [? Kate's ?] house
and go to the kitchen.
If you apply the same algorithm, you
will get maybe into the bathroom.
Like, you'll go two doors to the right,
and then you go to the wrong place.
So even memorizing the answer is not good enough.
You know what I mean?
You need to adapt to the environment.
So that's what makes reinforcement learning
a little more challenging, but also more
applicable to many other problems in reality.
You need to play around.
You need to interact with the environment.
There's no such a thing as, I can
think about what's going to be the best plan ahead of time
and never play with the environment.
We tried to write down some of these things
that we just mentioned, about that you
need to interact with the environment
to be able to learn.
This is very critical.
If you don't interact, if you don't try to walk,
if the robot doesn't try to move,
it cannot learn how to walk.
So you need to interact with the environment.
Also it will put you in weird positions and weird places,
because you maybe ended up in the end of a corridor,
or in a [INAUDIBLE] position, or maybe even unsafe cases.
There's another research also going on about safe RL.
How do I explore the world such as I don't break my robot?
Like, you make a really strong force, you may break the robot.
But probably you don't want to do that.
Because you need to keep interacting with the world.
And also, we collect data while we're training.
So as we're learning, we're collecting new data, fresh data
all the time.
So the data set is not fixed like in supervised learning.
We typically assume in supervised learning
that we have initial data set at the beginning,
and then you just iterate over and over.
And here, as you learn, you get fresh data,
and then the data changes.
The distribution of the data changes as you get more data.
And you can see that also for example in a labyrinth.
You don't know where you're going.
At the beginning, you're probably lost all the time.
And you maybe ended up always in the same places.
And maybe there's different part of the labyrinth you never
explore.
So you don't even know about that.
So you cannot learn about it, because you have never explored
it.
So the exploration is very critical in RL.
It's not only you want to optimize and exploit
the model that you have.
You also need to explore.
Sometimes, you actually need to do
what you think is the wrong thing, which is basically
go to the left here, because you've never been there,
just to basically explore new areas.
Another thing is like what we said
before, nobody's going to tell you what is the right answer.
And actually, many cases there's not a right answer.
There are multiple ways to solve the problem.
The reward only gives you an indication
you are going the right path or not.
But it doesn't tell you what is the right answer.
To train this model, we use a lot
of different surrogate losses, which
means also they are not actually correlated with performance.
Usually, it's very common, and you will see in a moment--
when the model is learning, the loss goes up.
When the model is not learning, the loss goes down.
So basically, loss going down is usually a bad sign.
If your losses stay at zero, you are learning nothing.
So you will see in a second, how you
debug these models became more and more tricky
than supervised learning.
We look at loss, our losses go down.
Beautiful.
And you take [INAUDIBLE] and the loss always goes down.
You do something wrong, the loss goes up.
Otherwise, the loss keeps going down.
In RL, that's not the case.
First, we have multiple losses to train.
And many of them actually don't correlate with performance.
They will go up and down--
it looks like random, almost.
So it's very hard to debug or tune these algorithms
because of that.
You actually need to evaluate the model.
It's not enough, the losses is not
enough to give you a good sense if you are doing well or not.
In addition to that, that means we require multiple optimizers,
multiple networks, multiple ways to update the variables,
and all those things.
Which means the typical supervised learning training
loop or model fit doesn't fit for RL, basically.
There's many ways we need to update the variables.
None of them use optimizers, some of them
we have multiple optimizers with different frequencies,
with different ways.
Sometimes we optimize one model against a different model
and things like that.
So basically, how we update the models
is very different from the typical way
of supervised learning, even though we use some supervised
learning losses, basically.
Some of the losses are basically supervised,
basically regression losses, something like that,
or cross-entropy.
So we use some of those losses by different ways, basically.
So probably, this graph is not very
surprising to most people who have used supervised learning
in the last years.
It used to be different in the past.
But now with neural networks, it usually always looks like this.
You start training your model, your classification loss
goes down.
Usually your regularization goes up,
because your [INAUDIBLE] is actually learning something,
so they are moving.
But your total loss is still, the overall total losses
still go up.
Regularization loss tends to stabilize, usually,
or go down after learning.
But basically, usually you can guide yourself
by your cross-entropy loss or total loss
to be a really good guide that your model is learning.
And if the loss doesn't go down, then your model
is not learning, basically.
You know that.
I still remember when I was outside Google and trying
to train neural net, the first neural net.
And I couldn't get the loss down.
The loss was stable.
And initialization couldn't get it back down.
And then I need to ask Christian [? Szegedy, ?] like,
what do you do?
How do you did it?
He's like, oh, you need to initialize
the variables this way.
You have to do all these extra tricks.
And when I did all the tricks he told me,
all of a sudden the losses start going down.
But once the losses start going down,
the model starts learning very quickly.
This is what it looks in many cases in RL.
We have the actor loss that's going up.
In this case, it's actually good,
because it's learning something.
We have this alpha loss, which is
almost like noise around zero, fluctuates quite a bit.
And the critic loss in this case just collapsed, basically.
At the beginning, it was very high, and all of a sudden
it got very small, and then it doesn't move from there.
But this model is actually good.
This model is learning well.
[CHUCKLES] So you see all these--
and there's not like a total loss.
You cannot aggregate these losses,
because each one of these losses is optimized in different part
of the model.
So we optimize each one of them individually.
But the other cases that you see the losses,
and then usually, the loss will go up, especially
sometimes when you're learning something,
because you can think about it this way.
You are trying to go through the environment,
and you see a new room you've never seen.
It's going to be very surprising for the model.
So the model is going to try to fit this new data,
and it's basically going to be out of the distribution.
So the model is going to say, I don't know,
this looks really different to everything I've seen before.
So the loss goes up.
When it basically learns about this new data,
then the loss will go down again.
So it's very common that we have many patterns
that the loss goes up and down as the model starts learning
and discover more rooms and more spaces in the environment.
AUDIENCE: But how do we know the model is doing well if we
don't--
SERGIO GUADARRAMA: So we need to look basically at the reward.
So the other function that we said that we actually
compute the [INAUDIBLE] reward.
And then basically we take a model,
run it through the environment, and compute
how well it's performing.
So the loss itself doesn't tell us that.
AUDIENCE: You're talking about not
the rewards during the training, but a separate reward
where you--
SERGIO GUADARRAMA: You can do both.
You can do both.
You can compute reward during training.
And that already give you a very good signal.
AUDIENCE: But during the training,
it would be misleading.
Because you haven't explored something,
then you won't see that it wasn't really good.
SERGIO GUADARRAMA: It's still misleading, exactly.
Yeah.
So we do usually both.
OSCAR RAMIREZ: And it's even more deceiving,
because when you have a policy that you're
using to collect data to train on, you most of time,
will have some form of exploration within that.
Every 10 steps you'll do a random action,
and that will lead to wildly different rewards over time.
AUDIENCE: But why is it not misleading even if you do it
separately from training?
Because ultimately, if your policy
is such that it doesn't really explore much, it will always--
when you throw that policy into a test environment,
and you no longer modify it, whatever,
but it might still-- if the policy is just very naive
and doesn't want to explore much,
it would look great, because it does everything fine.
But how would you know that it actually hasn't left--
OSCAR RAMIREZ: So when we're doing evaluations,
we want to exploit what we've learned.
So at that point, we're trying to use
this to complete the task that we're trying to accomplish
by training these models.
And so there, we don't need to explore.
Now we're just trying to exploit what we've learned.
AUDIENCE: But if it's not ready to react
to certain things that-- like, if it hasn't explored the space
so that in common situations it would still do well,
but it hasn't explored it enough that if it encounters
some issues it doesn't know what to do,
then that would not be really reflected by the reward.
OSCAR RAMIREZ: Yeah.
So you need to evaluate over a certain number of episodes.
And Sergio has a slide on like--
SERGIO GUADARRAMA: Like, maybe--
probably what you say.
Like, actually, evaluating once is not enough.
We usually evaluate it maybe 100 times,
from different initial conditions, all of that.
And then we average.
Because it's true.
It could be, you evaluate once, maybe it looks fine.
You never went to the wrong place.
You never fall off the cliff.
You're totally fine.
You evaluate 100 times, one of them will go off the cliff.
Because it was going to be one situation [INAUDIBLE] as well.
AUDIENCE: Also, do you mind clarifying
the three types of losses, what they correspond to?
SERGIO GUADARRAMA: So basically here,
the actor loss here corresponds to this policy,
distinguish acting in the environment.
Like, I need to make a decision about which action to take.
So we have a model which is basically saying,
which action I'm going take right now.
I'm going to move the paddle to the left or to the right?
So that will be your actor.
And we have a loss to train that model.
Then the critic loss is slightly different.
It's going to say, OK, if I'm in this situation
and I were to perform this action, how good will that be?
So I can decide should I take right, or should I take left?
So it's trying to give me a sense,
is this action good in this state?
And then basically, that's what we call the critic.
And then usually, the critic is used to train the actor.
So the actor will say, oh, I'm going to go to the right.
And the critic will say, oh, you go to the right, that's
really bad.
Because I know I give you a score by negative score.
So you should go to the left.
But then the critic then will learn the critic, basically,
by seeing these rewards that we [? observe ?] during training.
Then that gives us basically this [? better ?]
reward that the critic can learn from.
So the critic is basically regressing to those values.
So that's the loss for the critic.
And in this case, this alpha loss
is basically how much exploration, exploitation I
should do.
It's like, how much entropy do I need
to add to my model in the actor?
And usually, you want to have quite a bit
at the beginning of learning.
And then when you have a really good model,
you don't want to explore that much.
So this alpha loss is basically trying
to modulate how much entropy do I want to add in my model.
AUDIENCE: So I have often the entropy
is going up during our training.
But why the actor loss in your example
is also constantly going up in your training?
SERGIO GUADARRAMA: In the actor loss?
AUDIENCE: The actor loss.
OSCAR RAMIREZ: Yeah.
So basically, what happened is, as I mentioned,
the actor loss is trained based on the critic also.
So basically, the actor is trying
to predict which actions should I make?
And the critic is trying to criticize, this is good,
this is bad.
So the critic is also moving.
So the critic, as the critic learns and be better at scoring
this is a good action or not, then the actor
needs to change to that.
So this also, you can think also of this
is like a game going on a little bit.
You know, it's not exactly a game,
because they don't compete against each other.
But it's like a moving target.
And sometimes, the better the critic,
the less the actor needs to move around.
Usually it's stabilized.
The actor loss tends to stabilize way more
than the critic loss.
The critic loss I have seen in other cases--
this one is very stable.
But in many other cases, the critic loss
goes up and down much more substantially.
And going back to the question that you did before about,
how do we know we're doing well?
Because what I told you so far is like,
there's all these losses that doesn't correlate.
When we evaluate, we actually don't
know how well are we doing.
And even more profound is like, you
look to the graph on the left, where there's actually
two graphs, same algorithm trying to solve the same task--
the orange and the blue.
The only difference between these two graphs--
so higher is better, so higher return like this,
you are getting better performance--
is actually statistically much better than the blue one.
But the only difference between these two runs
are the random seeds.
Everything else is the same.
It's the same code, the same task.
Everything is the same.
The only thing that changed is the random seed.
It's basically how the model was initialized.
AUDIENCE: The random seed for the training,
or the random seed for the evaluation?
OSCAR RAMIREZ: The random seed for the training.
Yeah.
And then for the evaluation, we will usually run probably--
I don't remember-- probably 100 random seeds
different for every time that you're evaluating here,
you would run 100.
So to tackle this, what we did is, this work with Stephanie,
we were like, can we actually measure
how reliable is an algorithm?
Because RL algorithms are not very reliable,
and it's really hard to compare one algorithm
to another, one task to another, and all those things.
So we basically did a lot of work.
And we have a paper on the code available to basically measure
these things.
Like, can I statistically measure, is
this algorithm better than this one?
And not only is it better, is it reliable?
Because if I train 10 times and I get 10 different answers,
maybe one of them is good.
But it's not very reliable.
I cannot apply to a real problem,
because every time I train, I get a very different answer.
So basically, the broader these curves are, the less reliable
it is, because I will get every time I train--
I think this one we trained 30 different times--
and then you see some algorithms will have broader bands,
and some others will have narrow bands.
So the algorithm that have narrow bands are more reliable.
So we have ways to measure those, different metrics.
AUDIENCE: But don't you only care about the final point?
Why would you care about the intermediate points?
SERGIO GUADARRAMA: You care about both,
because let's think about it like, for example,
if you cannot reliably get the final point, it's not good.
If one algorithm say--
we have some algorithms that do that.
It's not here, because they are so bad.
Like only one in 100 that will get a really high number.
You train 100 times, one of them will be really good,
99 will be really bad.
So the question of which algorithm
do you want to use for your model?
One that 1 in 100 times you run will give you a good answer,
and it would be really good?
Or some one which is maybe not as good,
but consistently will give me maybe 90% of the other one?
So basically, we provide different metrics
so you can measure all those different things.
But be mindful of what you choose.
The final score is not the only thing
that you care, usually, for comparing algorithms.
You just want a policy, like you just
want to solve this problem, yeah,
the final score is the only thing you care.
But if we want to compare algorithms, I want to compare,
can I apply this algorithm to a new task?
If I need to run it 100 times every time I change the task,
it's not going to be a very good, very reliable algorithm.
OK.
I think we're back to Oscar.
OSCAR RAMIREZ: Cool.
So now that we saw all the problems,
let's see what we actually do in TF-Agents,
try and address and make it possible to play
with these things.
So to look at a bigger picture of the components
that we have within TF-Agents, we
have a very strict separation of how
we do our data collection versus how we do our training.
And this has been mostly out of necessity,
where we need to be able to do data
collection in a whole bunch of different types
of environments, be it in some production system,
or on actual real robots, or in simulations.
And so we need to be able to somehow deploy
these policies that were being trained by these agents,
interact with this environment, and then store all this data so
that we can then sample it for training.
And so we started looking first at what
do the environments actually look like.
And if you look at RL and a lot of the research,
there is OpenAI Gym and a lot of other environments
available through that.
And so for TF-Agents, we make all these available and easy
to use within the library.
This is just a sample of the environments
that are available.
And so defining the environments,
we have this API, where we can define the environment.
Let's for example think, what happens
if we want to define Breakout?
The first thing that you need to do
is define what your observations and actions
are going to look like.
This comes a little bit back from when
we started when we were still in TF 1,
and we really needed these information
for building the computation graph.
But it's still very useful today.
And so what these specs, they're basically
nested structures of TensorFlow specs
that fully define the shapes and types of what
the observations will look like and what
the actions will look like.
And so we think, specifically for Breakout,
maybe the observation will be the image of the game screen.
And the actions will probably be moving the paddle left,
moving it right, and maybe firing, so
that you can actually launch the ball.
So once you've defined what your data
is going to look like, there's two main methods
and environments that you have to define as a user--
how the environment gets reset, and how the environment gets
stepped.
And so a reset will basically initialize
the state of the environment and give you
the initial observation.
And when you're stepping, you'll receive some action.
If the state of the environment is
that we reach the final state, it will automatically
reset the environment.
Otherwise, it will use that action to transition
from your current state to a next state.
And this will give you the next state's observation
and some reward.
And we encapsulate this into a time step that includes
that kind of information.
And so if we're wanting to play Breakout,
we would create an instance of this environment.
We'll get some policy, either scripted or from some agent
that we're training.
And then we would simply iterate to try and figure out,
all right, how well are we doing over an episode?
This is basically a simplification
of what the code would look like if we were trying
to evaluate how good a specific policy is on some environment.
In order to actually scale and train this,
it means that we actually have to be collecting a lot of data
to be able to train on these environments
and with these methods.
And so we provide the tooling to be able to parallelize this.
And so you can create multiple instances of this environment
and collect data in a batch setting, where
we have this TensorFlow wrapper around the environment that
will internally use NumPy functions to interact
with the Python environment, and will then
batch all of these instances and give us batched time
steps whenever we do the reset.
And then we can use the policy to evaluate and generate
actions for every single instance of this environment
at the same time.
And so normally when training, we'll
deploy several jobs that are doing collection
in a bunch of environments at the same time.
And so once we know how to interact with the environment,
you can think of the driver and the observer.
These are basically like a For loop.
There's an example down the line.
But all of that data will be collected somewhere.
And in order to do training, what we do
is we rely on the data set APIs to be
able to sample experience out of the data sets
that we're collecting.
And the agent will be consuming this experience
and will be training the model that it has.
In most situations, it's a neural network.
In some of the algorithms, it's not even a neural network,
in examples like bandits.
And so we're trying to train this learnable policy based
purely on the experience, that is, mostly the observations
that we've done in the past.
And what this policy needs to do is,
it's a function that maps from some form of an observation
to an action.
And that's what we're trying to train in order
to maximize our long-term rewards over some episode.
And so how are these policies built?
Well, first we'll have to define some form of network to back it
or to generate the model.
In this case, we inherit from the Keras networks
and add a couple of utility things,
especially to be able to generate
copies of these networks.
And here will basically define, all right, we'll
have a sequential model with some conv layers, some fully
connected layers.
And then if this was, for example, for DQN,
we would have a last layer that would give us a predicted Q
value, which is basically predicting how good is
this action at a given state, and would tell us
what probabilities we should be sampling the different kinds
of actions that we have.
And then within the call method, we'll
be taking some observation.
We'll iterate over our layers and generate some predictions
that we want to use to generate actions.
And then we have this concept of a policy.
And the policy, what it will do is, it will know,
given whatever algorithm we're trying
to train, the type of network that you're training
might be different.
And so in order to be able to generalize
across the different algorithms or agents
that we're implementing, the concept
of the policy we'll know, given some set of networks,
how do we actually use these to take observations and generate
actions.
And normally, the way we do this is
that we have a distribution method that
will take this time step and maybe some policy state--
whatever you're training, some recurring models, for example--
and we'll be able to apply this network
and then know how to use the output of the network
in order to generate either some form of distribution--
in some agents, this might be a deterministic distribution--
that we can then sample from.
And then when doing data collection,
we might be sampling from this distribution.
We might add some randomness to it.
When we're doing evaluations, we'd
be doing a greedy version of this policy, where we'll
take the mode of this distribution
in order to try to exploit the knowledge that we've gathered,
and try to maximize our return over the episodes
when evaluating.
And so one of the big things with 2.0
is that we can now rely on saved models
to export all these policies.
And this made it a lot easier to generalize
and be able to say, oh, hey, now it doesn't matter
what agent you use to train.
It doesn't matter how you generated your network.
You just have the saved model that you can call action on.
And you can deploy it on to your robots, production, wherever,
and collect data for training, for example, or for serving
the trained model.
And so within the saved model, we
generate all these concrete functions,
and save and expose an action method,
getting an initial state--
again, for the case where we have recurring models.
And we also get the training step,
which can be used for annotating the data that we're collecting.
And right now, the one thing that we're still working on,
or that we need to work on, is that we
rely on tests for probability for a lot of the distribution
stuff that we use.
But this is not part of core TensorFlow.
And so saved models can generate distributions easily.
And so we need to work on that a little bit.
The other thing that we do is that we
generate different versions of the saved model.
Depending on whether this policy will
be used for data collection versus for evaluation,
it'll have baked in whatever exploration strategy
we have within the saved model.
And right now, I'm working on making it so that we can easily
load checkpoints into the saved model and update the variables.
Because for a lot of these methods,
when we're generating the saved models,
we have to do this very frequently.
But the saved model, the computation graph
that it needs to generate, it's the same every step.
And so right now we're saving a lot of extra stuff
that we don't need to, and so just being
able to update it on the fly--
but overall, this is much easier than what we had to do in TF 1,
where we were stashing placeholders in collections,
and then being able to rewire how we were feeding data
into the saved models.
AUDIENCE: So one question about--
you talk about distribution part in saved model.
So if your function fit into saved model,
the save is already a distributed function, then
it should be able to support--
like, you can dump--
OSCAR RAMIREZ: So we can have the distributions within it.
But we can't easily look at those distributions
and modify them when we deploy it.
Like, the return of a saved model function cannot be
a distribution object.
It can only be the output of it.
SERGIO GUADARRAMA: It can only be a tensor, basically.
The only outputs that the concrete functions
take in and out are tensors.
It cannot be an actual distribution, not yet.
Because the other thing, sometimes we
need to do sampling logics.
We need to do functions that belong to the distribution
object.
AUDIENCE: I see.
SERGIO GUADARRAMA: So we do some tricks in replay buffer
and everything, basically, that it's stored information
that we need to reconstruct the distribution back.
I know this object is going to be a categorical distribution,
and because I know that then I can basically
get the parameters of the categorical distribution,
rebuild the object again with these parameters.
And now I can sample, I can do all these other things
from the distribution.
Through the saved model, it's still tricky.
I mean, we can still save that information.
But it's not very clear how much information
should be part of the saved models,
or it's part of us basically monkey patching the thing
to basically get what we need.
OSCAR RAMIREZ: And the other problem with it
is that, as we export all these different saved models to do
data collection or evaluation, we
want to be able to be general to what agent trained this,
what kind of policy it really is, and what kind of network
is backing it.
And so then trying to stash all that information in there
can be tricky as well to generalize over.
And so if we go back circle now, we have all these saved models,
and all these are basically being used for data collection.
And so collecting experience, basically, we'll
have, again, some environment.
Now we have an instance of this replay buffer,
where we'll be putting all this data that we're collecting on.
And we have this concept of a driver that will basically
utilize some policy.
This could be either directly from the agent,
or it could be a saved model that's
been loaded when we're doing it on a distributed fashion.
And we define this concept of an observer, which will--
as the driver is evaluating this policy with the environment,
every observer that's passed to the driver
will be able to take a look at the trajectory that
was generated at that time step and use it to do whatever.
And so in this case, we're adding it to the replay buffer.
If we're doing evaluation, we would be computing some metrics
based on the trajectories that we're observing, for example.
And so once you have that, you can actually
just run the driver and do the data collection.
And so if we look at the agents, we
have a whole bunch of agents that
are readily available in the open-source setup.
All of these have a whole bunch of tests, both quality
and speed regression tests, as well.
And we've been fairly selective to make sure that we pick
state-of-the-art agents or methods within RL that have
proven to be relevant over longer periods of time.
Because maintaining these agents is a lot of effort,
and so we have limited manpower to actually maintain these.
So we try to be conservative on what we expose publicly.
And so looking at how agents are defined in their API,
the main things that we want to do with an agent
is be able to access different kinds of policies
that we'll be using, and then being
able to train given some experience.
And so we have a collection policy
that you would use to gather all the experience that you
want to train on.
We have a train method that you feed in experience,
and you actually get some losses out,
and that will do the updates to the model.
And then you have the actual policy
that you want to use to actually exploit things.
In most agents, this ends up being a greedy policy,
like I mentioned, where in the distribution method
we would just call them out to actually get the best
action that we can.
And so putting it together with a network,
we instantiate some form of network that the agent expects.
We give that and some optimizer.
And there's a whole bunch of other parameters for the agent.
And then from the replay buffer, we can generate a data set.
In this case, for DQN, we need to train with transitions.
So we need like a time step, an action, and then
time step that happened afterwards.
And so we have this num_steps parameter equal to 2.
And then we simply sample the data set and do some training.
And yeah.
And so normally, if you want to do this sequentially, where
you're actually doing some collection and some training,
the way that it would look is that you
have the same components, but now we
alternate between collecting some data with the driver
and the environment, and training on sampling
the data that we've collected.
So this can sometimes have a lot of different challenges
where this driver is actually executing a policy
and interacting with a Python environment outside
of the TensorFlow context.
And so a lot of the [? eager ?] utilities
have come in really, really handy for doing
a lot of these things.
And so mapping a lot of these APIs back into the overview,
if we start with the replay buffer and go clockwise,
we'll have some replay buffer that we
can sample through data sets.
We'll have the concept of an agent,
for example DqnAgent, that we can train based on this data.
This is training some form of network that were defined.
And the network is being used by the policies
that the agents can create.
We can then deploy these, either through saved models
or in the same job, and utilize the drivers to interact
with the environment, and collect experience
through these observers back into the replay buffer.
And then we can iterate between doing data collection
and training.
And then recently, we had a lot of help
with getting things to work with TPUs, and accelerators,
and distribution strategies.
And so the biggest thing here is that, in order
to keep all these accelerators actually busy,
we really need to scale up the data collection rate.
And so depending on the environments--
for example, in some cases in the robotics use cases,
you might be able to get one or two time steps a second
of data collection.
And so then you need a couple of thousand jobs just
to do enough data collection to be able to do the training.
In some other scenarios, you might be collecting data
based on user interactions, and then you
might only get one sample per user per day.
And so then you have to be able to scale that up.
And then on the distributed side,
all the data that's being collected
will be captured into some replay buffer.
And then we can just use distribution strategies
to be able to sample that and pull it in, and then
distribute it across the GPUs or TPUs to do all the training.
And then I'll give it to Sergio for a quick intro into bandits.
SERGIO GUADARRAMA: So as we have been talking,
our role can be challenging in many cases.
So we're hoping this subset of RL,
what is called multi-armed bandits,
we will go a little bit.
But this simplifies some of the assumptions,
and it can be applied to a [INAUDIBLE] set of problems.
But they are much easier to train, much,
much easier to understand.
So I want to cover this, because for many people who
are new to RL, I recommend then to start with bandits first.
And then if they don't work still for your problem,
then you go and look into a full RL algorithm.
And basically, the main difference
between multi-armed and RL is basically,
here you make a decision every time,
but it's like every time you make a decision,
the game starts again.
So one action doesn't influence the others.
So basically, there's no such thing
as long-term consequences.
So you can make a decision every single time, and that will not
influence the state of the environment in the future,
which means a lot of things you can assume
are simplified in your models.
And this one, basically, you don't
need to worry about what actions did I take in the past,
how do I do credit assignment, because now it's very clear.
If I make this action and I get some reward,
it's because of this action, because there's
no more sequential [? patterns ?] anymore.
And also, here you don't need to plan ahead.
So basically, I don't need to think
about what's going to happen after I make
this action because it's going to have
some consequences later.
In the bandits case, we assume all the things are independent,
basically.
We assume every time you make an action,
you can start again playing the game from scratch
every single time.
This used to be done more commonly with A/B testing,
for people who know what A/B testing does.
It's like, imagine you have four different flavors of your, I
don't know, site, or problem, or four different options
you can offer to the user.
Which one is the best?
You offer all of them to different users,
and then you compute which one is the best.
And then after you figure out which one is the best,
then you serve that option to everyone.
So basically, what happens during the time
that you're offering these four options to everyone,
some people are getting not the optimal option, basically.
During the time you are exploring, figuring out
which is the best option, during that time some of the people
are not getting the best possible answer.
So that is called regret--
how much I could have done better
that I didn't do because I didn't give you the best
answer from the beginning.
So with multi-armed bandits, what its tries to do
is, as you go, adapt how much exploration do I need to do,
and how confident I am that my model is good.
So basically, it will start the same thing as A/B testing.
At the beginning, it will give a random answer to every user.
But as soon as some users say, oh, this is better, I like it,
it will start shifting and say, OK, I
should probably go to that option everybody seems
to be liking.
So as soon as you start figure out--
you are very confident your model is getting better,
then you basically start shifting and maybe serving
everyone the same answer.
So basically, the amount of regret,
how much time you have given the wrong answer, decreases faster.
So basically, the multi-armed bandit,
it tries to estimate how confident I am about my model.
When I'm not very confident, I explore.
When I become very confident, then I don't explore anymore,
I start exploiting.
One example that is typically used
for understanding multi-armed bandits is recommending movies.
You have a bunch of movies I could recommend you.
There's some probability that you may like this movie or not.
And then I have to figure out which movie to recommend you.
And then to make it even more personalized,
you can use context.
You can use user information.
You can use previous things as your context.
But the main thing is, you're not
going to make a recommendation today,
and that is doesn't influence the recommendation
I make tomorrow.
And so basically, if I knew this was the probability that you
like "Star Wars," I probably should
recommend you "Star Wars."
What happens is, before I start recommend you things,
I don't know what do you like.
Only when I start recommending you things and you
like some things and don't like other things,
then I learn about your taste, and then I
can update my model based on that.
So here, there are different algorithms in this experiment.
Some of them-- here, lower is better.
Is this regret?
It's like, how much can I offer you the optimal solution?
Some of them, they're basically very random,
and it takes forever, doesn't learn much.
Some of them, they just do this epsilon, really,
basically randomly give you something sometimes,
and otherwise the best.
And then there's other methods that use more fancy algorithms,
like Thompson sampling or dropout Thompson sampling,
where a more advanced algorithm that basically give you
better trade-off between exploration and exploitation.
So for all those things, we have tutorials,
we have a page on everything, so you can actually
play with all these algorithms and learn.
And I usually recommend, try to apply a bandit algorithm
to your problem first.
Because it makes more assumptions, but if it works,
it's better.
It's easier to train and easier to use.
If it doesn't work, then go back to the RL algorithms.
And these are some of them who are available currently
within TF-Agents.
Some of them I already mentioned.
Some of them use neural networks.
Some of them are more like linear models.
Some of them use upper bounds about the confidence.
So they try to estimate how confident I
am about my model and all those things
to basically get this exploration/exploitation
trade-off right.
As I mentioned, you can apply it to many of the recommender
systems.
You can imagine, I want to make a recommendation,
I never know what you like.
I try different things, and then based on that,
I improve my model.
And then this model gets very complicated
when you start giving personalized recommendations.
And finally, I want to talk a couple of things.
Some of them are about roadmaps, like where
is TF-Agents going forward.
Some of the things we already hit, but for example,
adding new algorithms and new agents.
We are working on that, for example, bootstrapped
DQN, I think, is almost ready to be open-sourced.
Before we open-source any of these algorithms, what we do
is we verify them.
We make sure they are correct, we get the right numbers.
And we also add to the continuous testing,
so they stay correct over time.
Because in the past, it would happen to us
also like, oh, we are good, it's ready, we put it out.
One week later, it doesn't work anymore.
Something changed somewhere in the--
who knows-- in our code base, in TensorFlow code base,
in TensorFlow probably.
Somewhere, something changed somewhere,
and now the performance is not the same.
So now we have this continuous testing
to make sure they stay working.
So we plan to have this leaderboard and pre-trained
model release, add in more distributed,
especially for replay buffers and distributed collection,
distributed training.
Oscar was mentioning at the beginning,
maybe thinking in the future to add another new environment,
like Unity or other environments that people are interested in.
This is a graph that I think is relevant for people
who are like, OK, how much time do you actually
spend doing the core algorithm?
You can think of this as the blue box.
Basically, that's the algorithm itself, not the agent.
And I would say probably 25% of total time
is developed into the actual algorithm and all those things.
All the other time is spent in other things within the team.
Replay buffer is quite a bit time-consuming.
TF 2, when we did the immigration for TF 1 to TF 2,
it took a really good chunk of our time
to make that migration.
Right now, our library you can run in both TF 1 and TF 2.
So we spent quite a bit of time to make sure that is possible.
All the core of the library you can run.
Only the binary is different, but the core of the library
can run in both TF 1 and TF 2.
And usability also, we spent quite a bit of time,
like how to make refined APIs.
Do I need to change this, how easy is it to use,
all those things.
And we still have a lot of work to do.
So we are not done with that.
And tooling.
All this testing, all this benchmarking,
all the continuous evaluation, all those things, this tooling,
we have to build around it to basically make
it be successful.
And finally, I think, for those of you who
didn't get the link at the beginning,
you can go to GitHub TensorFlow agents.
You can get the package by pip install.
You can start learning about using our Colabs
or tutorials with DQN-Cartpole.
The Minitaur that we saw at the beginning,
you can go and train yourself.
And the Colab was really good.
And to solve important problems.
That's the other part we really care
about is, make sure we are production quality.
The code base, the test, everything we do,
we can deploy these models and all the things
so you can actually use to solve important problems.
Not only-- we usually use games as an example,
because they're easy to understand and easy
to play around in.
But many other cases, we really apply to more real problems.
And actually, it's designed with that in mind.
We welcome contributions and pull requests.
And we try to review as best as we can with new environments,
with new algorithms, or new contributions to the library.
[MUSIC PLAYING]