Placeholder Image

字幕列表 影片播放

  • Hello.

  • Welcome to this Tensorflow session about rain enforcement.

  • Luring s So this is you, Han.

  • I'm Martin.

  • And today we would like to build a neural network with you.

  • Do you know about neural networks?

  • If I say a serious question, if I say soft max or across entropy, raise your hand if you know what that is.

  • E like 1/2 1,000,000,000.

  • Okay.

  • Very quick primer.

  • This is a neural network.

  • Okay, Layers off neurons.

  • Those neurons, they always do the same thing they do awaited some off all of their inputs, and then you can layer them in layers.

  • The neurons in the first layer they do awaited some off.

  • Let's say pixels.

  • If we are analyzing images, then you don't see in the second layer will be doing weighted.

  • Sum some off outputs from the first layer.

  • And if you're building a classifier, let's say here probably you want to classify those little square images into airplanes and no airplanes.

  • You will end up on the last layer, which has as many neurons as you have classes.

  • Yeah, if those weights or configured correctly, we'll get to that that one of those neurons will have a strong output on certain images and tell you this is an airplane or the other one we have.

  • We'll have a strong output that you will tell.

  • You know, this is not an aeroplane.

  • Okay, there is one more thing in this.

  • There are activation functions.

  • So a little bit more details about what a neuron does for those who like it.

  • You have the full yips right here.

  • You have the full transfer function here.

  • So you see, it's a weighted sum.

  • Plus something.

  • Go, Tobias.

  • That's just an additional degree of freedom.

  • And then you feed this through activation function.

  • And in your own networks, that is always a non linear function.

  • And to simply five for us here on Lee to activation functions that counts.

  • So for all the intermediate layers, the white laters.

  • Ready?

  • That's the simplest function you can imagine.

  • You have it on the graphic.

  • Okay, just this.

  • It's nonlinear.

  • We love it.

  • Everyone uses that.

  • Let's not go any further on the last layer, though.

  • If we are building a classifier, typically what you use is the soft max activation function, and that is a new exponential, followed by a normalization.

  • So here I have a different classifier that classifies in 10 classes, and you have your waited sums coming out of your 10 final neurons.

  • And what you do in Soft Max is that you elevate all of them to the exponential, and then you compute the norm off that vector of 10 elements, and you divide everything by north.

  • The effect of that, since exponentially it's very steeply increasing function is that it will pull the winner apart.

  • I made a little animation to show you like this.

  • This is after South Max.

  • This is before some flex.

  • So yours, you see much clearer which off those neurons is indicating the winning class.

  • Okay, that's why it's called soft plaques.

  • It pulls the winner and parks like Max, but it doesn't completely destroy the rest of the information.

  • That's why it's a soft version of Max.

  • Okay, so just those two activation functions with that, we can build the stuff we want to build.

  • So now, coming out over a neural network, we have our let's say, here, 10 final neurons producing values which have been normalized between zero and one.

  • We can say those values are probabilities.

  • Okay, The probability off this image being in this class, How are we going to determine the weights in those weighted sums?

  • Initially, it's all just, you know, random.

  • So initially, our network doesn't do anything actually so useful.

  • We do this through supervised learning.

  • So you provided images which you have labeled beforehand.

  • You know what they are.

  • And on those images, your network is going to output this set of probabilities.

  • But you know what the correct answer is because you are doing supervised learning, so you will include your correct answer.

  • In a format that looks like what the network is producing, it's the simplest encoding you can think of.

  • It's called one harting coating, and basically it's a bunch of zeros was just 11 in the middle at the index off the class you want so here to represent the six, I have a vector of zeros with the one in the sixth position, and now those vectors were very similar.

  • I can compute the distance between them and the good of the people who studied this in a classifier, they tell us, don't choose any distance used across entropy distance.

  • Why, I don't know.

  • They're smarter than me.

  • I just follow in the interest of the cross entropy distances computed like this.

  • So you multiply element by element wth e elements of the vector, the known answer from the top by the logarithms off the probabilities you got from your neural network and you then sum that up across the fact this is the distance between what the network has predicted and the correct answer.

  • That's what you want.

  • If you want to trade a neural network, you get that it's called a natural function or lost function.

  • From there on, Tensorflow can take over and do the training for you.

  • We just need another function.

  • So in a nutshell, those are the ingredients in our part that I want you to be aware off.

  • You have neurons.

  • You don't do waited sums.

  • You have only to activation functions that you can use either the Railly activation function on intermediate layers or if you build a classifier on the last layer.

  • Soft max.

  • And the other function that we are going to use is the cross entropy error function that I had on the previous line.

  • Okay, all good on this bit of code, this is how you would write it in Tensorflow.

  • There is a high level A P I in Tensorflow called layers where you can in Stan.

  • She ate an entire layer at once.

  • You see, I mean, Stan, she the first layer here has 200 neurons and is activated by the radio activation function.

  • It's this layer here and this also in Stan.

  • She hates the weights and biases for this layer in the background.

  • You don't see that it's in the background.

  • I have a second later here, which is this one is just 20 neurons, again ready activation function on this.

  • I need to deal.

  • Do this little final layer, so it's a dense layer again with two neurons, and even if you do don't see it on the screen, it is activated by soft max, but you don't see it because I use its output in the cross entropy function, which has so slacks built in.

  • So it is a self next layer.

  • It's just that you don't see Ridge itself backs in the coat.

  • Finally, this here is my better function, which is the distance between the correct answer here.

  • One hot encoded and the output from my neural net.

  • Once I have that, I can give it to tensorflow taken optimizer.

  • Ask it to optimize this loss and the magic will happen.

  • So what is this?

  • Magic Tensorflow will take this at our function.

  • Differentiated relatively to all the weight's on All the bios is all the trainable variables in the system, and that gives it something that is mathematically pulled, a great agent.

  • And by following this radiant, it can figure out how to adjust the weights and biases in the neural network in a way that makes this error smaller.

  • That makes the difference between what the network predicts and what we know to be true.

  • Smaller.

  • That's supervised Lord.

  • So that was the primary.

  • Now why?

  • What do we want to build today?

  • We would like with you to build a neural network that plays the game off punk.

  • But just from the pixels of the game, it's notoriously difficult to explain to a computer the rules of the game and the strategies and all that.

  • So you wouldn't you want to do way of all that and just get the pixels and find some learning algorithm that will where it will learn to play his game.

  • And of course, that is not a goal in itself.

  • Because figuring out how to program a paddle to win punk is super easy.

  • You just always stay in front of the ball, you know, and you win all the time.

  • And actually we will train again.

  • Such a computer controlled agent.

  • The goal here is to explore learning algorithms because this has applications, hopefully way beyond punk.

  • So this looks like a classifier.

  • We just learned how to build a classifier.

  • You know, we have the big cells, We will build a neural network and it has three possible outcomes.

  • This is a position in which you want to go up, Stay still or you want to go down.

  • Okay, let's try to do this by the book in a classifier.

  • So we have a single intermediate layer off neurons activated with ready off course, and then a last layer withers three neurons activated by soft max.

  • We use the cross entropy loss, so you have the function here.

  • It's the distance between the probabilities that this policy network, it's called the Policy Network, predicts probability off, going up still or going down in the correct move.

  • Uh, but what is the correct move?

  • I don't know.

  • You know, you have no Martha, your executive right here.

  • Unlike in the supervisor endings problem.

  • We don't know what's correct.

  • Move to play here.

  • However, the environment requires that we keep making moves off going up or stains the or moving down to progress again.

  • So what we're gonna do is work on the same photo, which means picking one of the three possible moves off, moving a pedal up since the or going down randomly, but not completely randomly, though we're going to pick the movie based on the output of the network.

  • So, for example, the network stays the output for ability off going up 0.8.

  • Then we should make it so that there's 80% of chance we choose to move up as our next.

  • All right, so that we know how to play the game.

  • The policy, that's what gives us probabilities were all along loaded dice and pick from that.

  • And we know what next move toe Blake.

  • But initially, this network is initialized with random waits, so it will be playing random moves.

  • How is that?

  • How does that inform us about the correct move to play.

  • I need to put the correct moving my formula.

  • That's right.

  • So in this case, we really want to correct move to mean moves that will lead to winning.

  • You don't know that until someone has scored a point.

  • And so that's what we're gonna do only when someone, either our pedal with abundant pedal, has scored the point.

  • Do we know we have played well or not?

  • So whenever somebody stores were going to give ourselves a reward, if our repair those scores, the point will give ourselves a positive one reward point.

  • And if the opponent paddles florist that we give ourselves a negative one reward points, and then we're going to structure our lost function slightly differently than before.

  • Um, over here once again, over here you see some aloe sounds very much like the crows, and she'll be function.

  • We still before now in the middle.

  • Here is the main difference.

  • Where, instead of the correct labelling a supervised learning problem, we're just going to put the simple the movie in there move.

  • We played in the movie that we're having to play and well, some of the moves removes the helpers.

  • And so that's why every lost, long lost value is multiplied by the reward out front.

  • This way, moves that eventually lead to a winning point will get encouraged and moves that leads.

  • We're losing.

  • Point will be discouraged over time.

  • Okay, so you do this for every move.

  • I can see how, with this little modification, it could lead to some learning.

  • But up putting back my mathematicians hat on.

  • I see a big problem here.

  • You have the sample move.

  • That is a picking operation, a sampling operation.

  • You pick one out of three that is not defensible and to apply, you know, the minimization and radiant descent and all that.

  • The last function must be different.

  • Chewable.

  • You're right, Marty.

  • I see the hard question here.

  • Simple move here.

  • It actually does depend on the models, ways and biases.

  • But the simple in operation is not defensible.

  • And so we're not going to differentiate that.

  • Instead, work going to look at a simple moves us if they're Constance and then we play many games across many, many moves to get a lot of data treating all of these simple moves as if their constant comes the labels and only differentiating the probabilities that our output by the model in blue here on the screen in those probabilities directly depend on the models, ways and biases.

  • And we can differentiate those with respect to the ways of bias is this way.

  • We still get a very radiant and can apply grading descent techniques.

  • Oh, okay.

  • So you kind of cheated the part that is problematic.

  • Just regarded as constant.

  • You're going to blame any games with the same neural network.

  • Accumulate those plate moves, accumulate those rewards.

  • Whenever you know that you scored a point, you accumulate those rewards and and then you plug that in and you only different shade relatively to the predicted probabilities.

  • And yes, you're right.

  • That still gives you a radiant.

  • That depends on weights and bios is we should be able to do that.

  • Okay, I get it.

  • This is clever.

  • So this will actually brain probably very slowly here.

  • We want to show you the minimum amount of stuff you need to do to get it to train.

  • But in the minimal amount, there are still two little improvements that you always want to do.

  • The 1st 1 is to discount the rewards.

  • So probably if you lost the point.

  • You did something wrong in the 357 10 moves right before you lost that point.

  • And probably before that you bounced the ball correctly a couple of times and that is that was correct.

  • You don't want to discourage that.

  • So it's customary to discount the rewards through time, backwards through time with some exponential discount factor so that the moves you played closest the scoring points are the ones that count the most.

  • You see the hero.

  • We discounted them with a factor of 1/2 for instance, and and then there is this normalization steps.

  • What is that?

  • Yeah, that's very interesting.

  • Experiments that we notice putting in a normalization staff re help training to make learning faster.

  • They're from multiple ways to think about me is the way I like to think about this is in the context of playing again.

  • You see, the beginning of the time, the model only had the randomized weights and biases, so they're gonna make a random moves the most of time.

  • It's not going to make the right move only once, and wire is going to score a point, but it's really by chance by accident and most of time is gonna lose points.

  • And we really want to find a way to naturally put more ways.

  • So the very rare winning moves there so they can learn what the correct moves and performing this.

  • Normalizing the reward step here naturally gives a very nice boost to the rare when they move, so it does not get lost in a town off losing moves.

  • Okay, so let's go.

  • Let's train this.

  • We need to play the game and accumulate enough data to compute this everything that we need for this function.

  • Okay, so those are the sample news, we need the rewards.

  • And we also need those probabilities which the network can compute from us from the big cells.

  • So we are going to collect the pixels during game play.

  • Um, that's how our data said collected data that looks like, Okay, you have one column with the move.

  • You actually played one column where you would like to see the probabilities predicted at that point, but you will actually going to store just the game board and run the network to get the probabilities in the last column with rewards.

  • You see, that is a plus one or minus one reward on every move that scored a point.

  • And on all the other moves you discount that reward backwards in time with some exponential discovered.

  • So that's what we want to do.

  • And once you have this, maybe you notice in the formula, you just multiply those three columns together and some all of that.

  • That's our loss.

  • That's how the loss is computed.

  • Let's build it.

  • You implemented this.

  • This, this this demo.

  • So can you walk us through the code?

  • Let's do that.

  • Um, so up here you see a few tensorflow placeholders.

  • You should really think of them as function arguments that required to compute Delta values for model.

  • And so, with three placeholders for the import, the one for the observation remember, this really means the difference between two consecutive friends.

  • You know, gameplay.

  • Actually, yes, because we didn't say that, actually, just in the game off, punk, you don't really train from the pixels because you don't see the direction of the ball with from just the big stuff.

  • But you train from the delta between two frames because there you see the direction of the ball.

  • That's the only wrinkle we are doing here, which is specific to punk.

  • All the rest is reinforcement learning, vanilla reinforcement, learning that applies to many other problems.

  • That's right.

  • And for the actions placeholder, it's just going to hold all the simple move.

  • The moves that the model happened, decided play and the rewards placeholder world collect all the discount.

  • The rewards with those imports were ready to be the model.

  • The network's really simple is like the one market show before it has a single dense layer, a single dance hidden layer with the activation of the real function.

  • 200 neurons followed by a soft max layer you don't receive calling us off the max function here.

  • And that's really because the next day the same pulling operation already takes the in lodges, which values you get before you called it off Max function, and it can perform Martino Meo sampling, which doesn't means output a random number of 01 or two for the three classes that we need output based on the probabilities.

  • Specify the look by the lodges, and this one has the soft bags building so already building.

  • So this isn't soft, Max.

  • Later.

  • Okay, well, you don't see it on the color, but it is okay.

  • And just apprentices.

  • For those not familiar with a tensorflow tensorflow builds a note.

  • A graph of operations in the memory s So that's why out of multiple normal, we get an operation, and then we will have an additional step to run it and actually get those predictions out.

  • And place holders are the data you need to put in when you actually run a note to be able to get a new medical results.

  • Okay, so we have everything we need to play the game.

  • Still nothing to train.

  • So let's do the training porch for training.

  • We needed lost function.

  • So our beloved cross, my, uh, across entropy lost function, computing the distance between our actions.

  • So the moves we actually blade and lodge, it's which are from the previous screen.

  • That's what the network predicts from the pixels.

  • And then we modify it to you to use using the reinforcement learning paradigm we modified by multiplying this, uh, per move loss by the rewards by the porter move rewards.

  • And now, with those rewards moves leading to a scoring point will be encouraged.

  • Moves leading to losing point will be discouraged.

  • So now we have our editor function.

  • Tensorflow can take over.

  • We pick one of the optimizers in the library and simply ask this optimizer to minimize our last function, which gives us a training operation.

  • And we will, when we will on the next flight run this training operation feeding in all the data we collected during game play.

  • That is where the radiant will be computed.

  • And that's the Operation one run that will modify the weights and biases in our policy network.

  • So let's lay this game.

  • This is what you need to do to play one game in 21 points S o.

  • The technical wrinkle is that intensity flow.

  • If you want to actually execute one of those operations, you need a session.

  • So we did find a session, and then in a loop, we play a game.

  • So first we get the pixels from the game stage Computer delta between two frames.

  • Okay, that's technical.

  • And then we run session, not run this sample operation, remember, Sample operation is what we got from our picking multi normal function.

  • so that's what decides the next move to blame.

  • Then we use ah Pong simulator here from opening I Jim.

  • We can give it this move to play and it will play the move and give us a new game.

  • State.

  • Give us a reward if we scored a point and give us information on whether this game in 21 points is finished.

  • Okay, that's what we need.

  • Now.

  • We simply log all that stuff.

  • We locked the pig cells.

  • We love the move we played.

  • We lock the reward if we got so we know how to play one game and we will call him, play many of those games to collect a large backlog of moves.

  • Right?

  • That's right.

  • Playing games in reinforcement, running or in our experiment really is the way of collecting data.

  • Now that we have collected loved a top play one game, only 10 games, why not?

  • We can start processing rewards as we plan to with discounts the reward so that a move stop did not get any reward.

  • You and Kim play now cares that this country, the reward based on whether or not they eventually led to winning or losing points and how far they are from the moves that actually won or lost the point.

  • And then we normalize the rewards as we rent before.

  • Now we're ready.

  • We have all the data in place.

  • We have observations.

  • That's the differences between game frames the actually stopped having played and the reward so that we know if those actions were good or bad.

  • And then what were the cold, the trending up and training office when we saw a couple slides before where we had optimized over the initial lives?

  • And this is going to do the heavy lifting and compute the radiance for us and modify the way it's just slightly supposed to play 10 games would modify the way slightly and then go back and play more games to get even more data and repeat the process and expectations or the whole base.

  • The mother would rather play a little better every time I'm a bit skeptical.

  • Do you really think this is going to work?

  • Well, sometimes we see the model.

  • Is your parental play a little bit worse?

  • So we'll see.

  • Okay, Live demo.

  • I like them all.

  • Let's go.

  • Uh, let's run this game This is a live people.

  • I'm not completely sure that we're going to win, but we shall see.

  • So brown on this side is the computer controlled paddle.

  • Very simple algorithm is just stays in front of the ball at all times, so there is only one way to win to win.

  • Its vertical velocity is limited, so you have to hit the ball on the very site of the battle to say it's to send to send it at a very steep angle.

  • And then you can overcome the vertical velocity of the opponent.

  • That's the only way to score on the right in green.

  • We have our your electorate controlled agent, and it's slightly behind them, so we'll see if it wins.

  • And if you want, I want this side of the room to cheer for Brown and this out of room to cheer for a guy.

  • Okay?

  • It's very even right now.

  • One.

  • Yeah.

  • Go, go, go A.

  • I is winning a I was winning.

  • I'm happy because this is a live demo.

  • There is no guarantee that a I will win.

  • Actually win.

  • One thing that is interesting here, actually, is that this is learning from just a big cells.

  • So initially the I had no idea off even which what game it was playing.

  • What?

  • The rules where it had even no idea which pedal it was playing.

  • Okay?

  • And we didn't have to explain that.

  • We just give the pixels.

  • And on scoring points we get a give a positive or a negative reward than that's it's from that.

  • It learns and you see, you see those emerging strategies.

  • Like what I said.

  • He think the ball on the side and send it again at a very steep angle is the only way of winning.

  • And it picked up.

  • We never explain it.

  • It's just on emerging strategy.

  • This is looking good clothes.

  • Yeah, it's looking good.

  • It's It's a bit close, but is looking good.

  • Do you think will win?

  • Okay, next point.

  • I won't allow cheer when a I wins, cause I hope this is going to work.

  • Uh, hey, 20.

  • Okay, one more.

  • One more.

  • Yeah.

  • Dunning.

  • Good job you have.

  • This is fantastic.

  • All right, So what was going on during game play?

  • Actually remember how this network was built?

  • Uh, right here.

  • Eso neurons in this very first layer they have a connection to every pixel off the board.

  • They do awaited some off all the pixels in the board.

  • All right, so they have a weight for every pixel, and it's fairly easy to represent those weights on the board and see what those neurons are seeing on the board.

  • So let's try to do that right here.

  • Tips.

  • We pick eight off those 200 neurons, and here we visualize superimposed on the board.

  • The weights that have been trained in to the untrained eye doesn't see much in here.

  • Maybe you can enlighten us.

  • You had any interesting Martin.

  • We're looking at what a model cease or rather, one a model cares about when you sees the game pixels that we gave it.

  • You see that one of new roads apparently had been pretty much nothing.

  • It's still looking like a white noise at the beginning.

  • Initialization.

  • Bad people here contribute much to the game play, but for the only other neurons, you see some interesting patterns.

  • There.

  • We see a few things started.

  • Model seems to put a lot of weights on.

  • It cares a lot about where the opponent panel is and where is moving and cares about the Bulls trajectory across a game board and, quite interestingly, also cares a whole lot about where its own panel is on the right because, like a market point out before, at the beginning of learning in the model don't even know which pedal is playing.

  • So he hasn't learned to remember that's important piece of information.

  • You will be able to play them whale.

  • And when you think about this, this is really consistent with how we human beings will believe cos.

  • Of enforcing informations will be able to play the game whale.

  • So we wanted to show this to you not to show our promise.

  • That punk.

  • Um although it worked, uh, but mostly to explore training algorithms and you see mostly neuro networks.

  • You do supervised training and in nature.

  • Well, sometimes when we teach people, it looks like supervised training, probably in class, the teacher says.

  • This is, you know, the Eiffel Tower, and the people said, Okay, that's the Eiffel Tower and so on.

  • But if you think about a kitchen jumping on a fur bowl and missing it and jumping again until it catches it and there is no teacher there.

  • It has to figure out a sequence of moves on.

  • It gets a reward from catching the ball or not catching it.

  • So it looks like in nature there are multiple ways off training our own neural networks, and one of them is probably quite close to this reinforcement learning way off picture you have.

  • You build this model.

  • Is there some other thoughts?

  • It inspired you?

  • Yeah, I have a tactical insights may be for us a takeaway message.

  • In running experiments, we see that there's some staff who should not be friendship.

  • All particularly in this case sampling a move are the probability output from a network and playing a game again and reward back.

  • Yes, those factors really depend on the model yourself, but in a non defensible way.

  • And so even with the powerful tools like Tensorflow, you wouldn't be ableto very naive.

  • Review the lost function and around Grady and dissent training.

  • But there are ways to get around it, and that's exactly the techniques were showing today.

  • So what you're saying is that the reinforcement learning can solve many more problems than just punk, and it's a way of getting around some non differential step that you find in your problem.

  • You should be able.

  • That's great.

  • That's great.

  • So what is this going?

  • We want to show you a couple of things from the lab because this has had mostly lab applications and then one last thing.

  • What is this?

  • This is very interesting.

  • Everyone.

  • What we're witnessing here is a human expert or pancake flipping, trying to t shirt robotic.

  • I'm doing the same thing.

  • And there's a model in the bags approaching robotic gum whose output controls joints movement or the mortars, innit?

  • What angle?

  • What speed to move toward.

  • And the goal of these is to flip a pancake in the pink is not just any regular pink able.

  • Rather is the instrumented with sensors so that he knows he has been flipped.

  • If you lend it on the floor or on the table or banking, the frying pan doesn't seem to be working, it's trying.

  • It's right, um, in another experiment.

  • So what is the reward here?

  • The reward probably is that if you flip a pancake correctly, then you get a posse reward and otherwise negative rewards.

  • OK, speaking reward function.

  • In another experiment, I was gonna say.

  • Experimenters said the rewards will be a small about the past, the reward for any moment, the pancakes not on the floor and that machine learned.

  • But what I learned was to free the pancake as high as possible to maximize pancake airborne time.

  • But I would like that.

  • That way you can maximize the reward just because you take a long time for you laying on the floor.

  • That's kind of funny, but actually it za nice illustration of the fact that you can change the learned behavior by changing your loss function.

  • Exactly the model.

  • That's how you teach it.

  • Start.

  • You have to learn.

  • 0 90 Yeah, cool.

  • We know how to play Pong and flip pancakes.

  • That's significant progress.

  • Deepmind also published this video, so they used reinforcement learning on they build those skeletal models here.

  • The neural network is predicting the power to send to the simulated muscles and joints off these models, and the reward is basically a positive reward whenever you managed to move forward and a negative reward when you either move backward or when you fall through a hole or when you just crumbled to the ground, the rest is just reinforcement, learning as we have shown you today.

  • So all off these behaviors are emerging behaviors.

  • Nobody taught those models.

  • And look, you have some wonderful emerging.

  • The heaters.

  • It's coming in a couple of seconds.

  • Look at this jump after this.

  • Those are nice jumps, but there is a much nicer one.

  • In a second, you will see a jump Really athletic jump with the model swinging arms to get momentum than lifting one leg cushioning right here.

  • Look at this.

  • This is a fantastic athletic jump.

  • You it looks like from the from the Olympics, and it's completely emerging behavior.

  • Marty, what is it with this often mice way of walking around in white?

  • Why not see people doing this?

  • I could, but well, you know, there are multiple ways of running.

  • Probably the last function.

  • Didn't have any factor discouraging, you know, useless movements again.

  • By modifying the last function, you get different behaviors and actually one last one, not this one.

  • This one is kind of funny.

  • Yes, Still, it's laying around, but I like I like the look here.

  • It's figured out how to run sideways.

  • This is fantastic.

  • Yes, there are two ways of running, and it did figure out how to move sideways.

  • This one you probably have seen this is move 74 in Game four off Alphago versus Lisa at all.

  • And that's the one move that lease it'll blades.

  • That was called the God move.

  • And he's world famous for just that.

  • He played one correct move and managed to get to win one game against Alphago.

  • He lost 4 to 1, which is fantastic.

  • And Alphago also uses reinforcement learning not exactly in the same way here.

  • It wasn't entirely built out of reinforcement learning okay, because for a turn based games, the algorithm for winning is actually quite easy.

  • You just play all the moves to the end and then pick the ones that leads to positive outcomes.

  • The only problem is that you can't compute.

  • Then there are too many of them.

  • So you use what is called a value function.

  • You unroll only a couple of moves, and then you use something that looks at the board and tell you this is good for white and good for black or good for black.

  • And that's what they built using reinforcement learning.

  • And I find it interesting because it kind of emulates the way we humans solve this problem.

  • This is a very visual game.

  • We have a very powerful visual cortex.

  • When we look at the board, go is a game of influence.

  • And we see that in this region black has a strong influence in this region.

  • White has a strong presence, so we can kind of process that.

  • And what they built is a value function that does kind of the same thing and allows them to under all the move to a much shallower death.

  • Because after just a couple of moves, their value function built using reinforcement learning tells them, This is great for white or good for black.

  • So these are results from the lab.

  • Let's start to do something rials.

  • So some applications.

  • What if we build on, you know, network?

  • It's a recurrent, you know, network.

  • Okay, that's a different architectural neural network.

  • But for what you have to know, it still has weights and biases in the middle.

  • Okay.

  • And this your own network of Rick you're in.

  • Your levers are good for producing sequences.

  • So let's say we build one that produces sequences off characters in't we structure it so that those characters actually represent a neural network.

  • Yeah, you know, network is a sequence of layers, so you can figure out a syntax for saying this is my first layer.

  • This is how big it is and blah blah blah.

  • So why not produce a sequence of characters that represents a neural net?

  • What if then we train this neural network on some problem we care about, Let's say, spotting airplanes in pictures So this will train to a given accuracy.

  • What if now we take this accuracy and make it a reward in a reinforcement learning algorithm?

  • So this accuracy becomes a reward and we apply reinforcement Learning, which allows us to modify the weights and buy us, is in our original neural network to produce a bitter neural network architecture.

  • It's not just shooting.

  • The parameters were changing the shape of the network.

  • That works better for our problem.

  • The problem we care about, we get a neural network that is generating a neural network for our specific problem.

  • It's cool, new nor architectural search, and we actually published a paper on this, and I find this very nice application off a technology designed initially to beat punk.

  • So, Martin, you're saying we have neural networks are run to build other neural networks.

  • Steve, man.

  • So you you had to finish, you build his name.

  • Oh, so can you tell us Ah, word about the tools that you used?

  • Yeah, definitely.

  • So we use tensorflow for the model yourself and tens aboard for tracking the metrics of during the course of training.

  • I really don't want to run the training on my laptop, and I probably could So I used the cloud machine running engine for the training mother that was playing the game that we saw before Life demo took maybe about one day of training a company.

  • That and so with them a legend.

  • You have this job based of you and you can launch 20 jobs with different parameters and just let them run.

  • And it's just practicality, right?

  • It will tear down the whole cluster when they're when every job is done.

  • And someone that's right.

  • Worry free.

  • Yeah, I used a managing a lot of as well for that.

  • There are many other tools in cloud for doing machine learning, but when we just launched his photo on television.

  • That's one.

  • When you do not program, you just put in your label data and its figures out the model for you.

  • And now you know how the torrents and also this uses lots of CPU GPU cycles.

  • So cloud Teepees are useful when you're doing, you know, architectures search, and they are available to you as well.

  • So thank you.

  • That's all we wanted to show you today.

  • Please give us feedback.

  • We just released this coach to get up, so you have to get have your L.

  • If you want to train a punk agent yourself, go into it.

  • You can take a picture there, and if you want you the you are really still on the scream.

  • If you want to learn machine learning, I'm not going to say it's easy, but I'm not going to say it's impossible, either.

  • We have this Siri's relatively short Siri's off videos and code samples and code labs called Tensorflow without a PhD that is designed to give you the keys to the machine learning kingdom.

  • So you go through those videos of this talk is one of them, and it gives you all the vocabulary and all the concepts, and we are trying to explain the concepts in a language that developers understand because we are developers.

  • Thank you very much.

Hello.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

TensorFlow和深度強化學習,不需要博士學位(Google I/O '18)。 (TensorFlow and deep reinforcement learning, without a PhD (Google I/O '18))

  • 2 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字