訓練與測試深度強化學習(DQN)代理--強化學習第6頁。 (Training & Testing Deep reinforcement learning (DQN) Agent - Reinforcement Learning p.6)

字幕列表影片播放

What's going on.
Everybody welcome apart six of the reinforcement learning tutorial Siri's as well as part two of the Deep Que learning as well as de que en and deep cute networks tutorials where we left off.
We're basically ready to add the train method to her agent and then basically incorporate her agent to the environment and start that literate of training process.
So I've switched machines.
I'm on a paper space, a boon to machine.
Now and then, I've pulled up the exact versions that I'm using.
Also, people ask a lot like Why am I on paper?
Space is just a nice virtual machines in the cloud kind of cloud desktop, but also high end G pews.
It's great for doing machine learning in the cloud if you want.
I'll put a link in the description referral link.
It's like a $10 credit.
You could definitely check him out.
It wasn't really meant to be a sponsor spot and, like I really do use paper like I use the heck out of paper.
So anyway, and I've got plenty of high end reap use locally, it's just very convenient to use paper space.
But anyways, I've pulled up the exact versions of things I'm using.
Tensorflow two point.
Oh, is on the way.
Uh, just not quite ready.
So anyway, still using tensorflow one?
These are the exact versions you can feel free to fall along on different version.
It's just if something is going wrong and you want to match my exact versions of things, you can do this.
Also, in case you didn't know what's not all caps, please.
In case, you know, like t install, exact versions would be like Pip three install tensor flow dash GPU double equals 1.13 point 14 examples.
That's how you installed an exact version of something.
Okay, uh, let's go ahead and continue.
So I am in.
Actually think I've got it up already?
Yeah.
Cool.
Um, thanks, Harrison.
Any time, bro.
So so we're gonna do is come down to our agent class of the D.
Q.
An agent.
And I'm just gonna go to the very bottom here.
I'm gonna make a bunch of space and come up here.
And now what we're gonna do is just add the new train method, so define train, and then we're gonna pass here self terminal underscore states, and then step Whatever.
Step around.
Okay, So the first thing we want to do is do a quick check to see should we actually train?
So recall we're gonna have this replay memory and then from the replay memory, which should be quite a large memory, we're gonna grab a mini batch that's relatively small, but also is a batch size that is of decent size.
So typically with neural networks like to train in batches of, like, 32 or 64 something like that.
Um, so we're gonna do the same thing here, so we want 32 or 64 to be pretty small compared to the size of our memories.
So in this case, I want to say our max memory size is 50,000 um, and then the least amount that we're willing to train on its 10,000.
And the reason why we want to do that is we don't want to wind up over fitting.
So we want to use this replay memory principle, I guess.
But what we don't want to do is have replay memories so small that every step is trained on the exact same thing.
So we're effectively doing like, you know, hundreds of pox on the exact same data.
That's not what we want.
So anyway, we're going to say is if Len of self dot replay underscore memory is less than the men replay memory size.
If that's the case, we'll just return.
We actually want to do anything here.
Otherwise, if it's enough, then let's go ahead and get our mini batch, which would be a random dot sample of self doubt.
Replay, memory and the sample size that we want is mini batch underscore size, and we need to do both.
We need to import random and set the mini batch size, so I'm gonna go to the tippy top here.
I'm going to import random, and then I'm gonna come down here.
Mini Batch size will set that to 64 I'll go back to our bottom here.
Awesome.
So now that we've done that, we want to get our accu values.
So don't forget as well that, um, one of the things on my face.
Uh, let's go here.
Python programming dot net.
Let me type d.
Q.
N.
We're actually gonna use some stuff from the other one, too, so just pull it up.
But recall the following image.
So we actually want to get, um so this bit is kind of handled otherwise both learning rate and then just some of this logic is handled by the neural network.
But we still need the reward discount in that future value.
So we still want to use this little bit.
So to do that, we need to know current que values and future que values.
So, uh, mini batch.
Okay, so what we're gonna say here is current underscore states is equal to the number pi array.
And there really is a list comprehension here.
We're gonna say transition 04 transition in mini batch.
Um, okay, cool.
And then what we want to do here is normalize it.
So again, if you don't know why we're normalizing that check out the basics tutorials.
But basically, it's just images.
So any time you have RGB images that it's like 022 55 onward, you can really not have normalized the wrong word.
I've probably missed saying that wrong the whole time anyway, Instead, we're actually trying to do is scale it because all the images are normalized already pretty much, but instead we're actually trying to scale it between zero and one.
That's just the best way for convolution, all neural networks to learn on image data.
It's just useful toe, really.
It's the best for any machine learning.
You generally want to scale between zero and one or a negative one and one anyways.
Enough on that.
So current states, then what we want to grab is the current cues list, and that is also gonna be equal to self dot model dot Predict now pay attention, that SOFA model.
So it's that that model that is the crazy model self top model that predict, uh, and we want to predict on current states.
And then we want to do the same thing with future.
So we're going to say new current states.
This is after we take steps, it's going to be able to the num pyre Ray.
And again we're gonna say transition.
Um three.
You want to say?
Is that right?
Yes.
Transition 34 transition in our mini batch, and again we want to do by 2 55 and I'll explain these index values in a second if you don't understand those, or you forgot, uh, should be up here somewhere, um, or is that I think we're gonna have to define that in our environment.
Anyway, I'll explain that in a moment.
Um, yeah.
So coming back new.
Okay, so now what we're gonna say is future cues.
I'll explain it.
Basically, bottom line, you'd accuse list is evil thio self dot target model.
So now we're using that target model that doesn't so crazily change dot Predict against the new current states.
Now, we're gonna separate things out into our typical exes and wise.
So these will be our feature sets.
These will be our labels or targets, so these will be This will be, like, literally the images from the game.
And then this will be the action that we decide to take.
Um, Which we still that's gonna be in the environment.
It's one.
We'll get there anyway.
It'll be like up, down left, right, diagonals and all that stuff.
Okay, so now what we're gonna say is four index and then we're gonna have this giant to pull.
And this is what is consisting in this mini batch of things.
So you've got the current state.
You've got the action.
The reward the new current state Tut Don't put the s there and then whether or not we're done with the environment.
So that's exactly where these indexes air coming from.
So current states were grabbing the current state from mini batch.
Um, And then down here, we're grabbing that new 0123 Right.
We're grabbing that from the mini batch and they were predicting on it based on the state itself.
But then with this information, what we can do is calculate that last step of our formula from before, right?
We can calculate this little bit here to create our cues.
So for index creditor todo done, apparently my mouse has decided to stop working.
Ah, what do we want to do?
The first thing to say is, if not done, then we still want to actually perform our operations.
So what we're gonna say is, Max Future Q is equal to the n p dot max of future cues list for whatever index were actually at right now.
And then we're gonna say the new Q is equal to the reward plus the discount discount which I don't think we have cause it didn't try to auto complete that times, Max.
Future Q Uh, else we're just going to say new cubed equals reward.
So if we are done, then we need to set the nuke you to be whatever the word is at the time, because there is no future acute.
Right?
We're done.
Okay, so now we need to go up to the top and set our discount.
Okay, so going up to the top here, let's set, uh, discount equals 0.99 Then we'll go back to the bottom here and now we want to do is actually update these cues.
So recall, Mom, this is how our neural network's gonna work.
You got input, then you got the output, which is your cue values.
So let's imagine the scenario where we got this out, the action that we would take given no Epsilon, let's say would be this one right.
01 at index one.
We take action here for because the Q value is 9.5.
It was the largest, que value.
But then let's say I we don't like that action.
We ended up actually degrading it a little bit.
And we actually want that nuke you value to be 8.5.
We'll tow update that for a neural network, like in our table.
We just updated it right.
You're somebody that one value.
But in the neural network, we output thes four values.
So what we end up having to do is we update this 9.5 to 8.5, like in a list.
And then we re fit the neural network to instead be 3.28 point 57.21 point three.
So that's the next thing that we need to do.
So coming down here what we want to say still, in our four loop, we're gonna say current cues equals the current cues list at the index that were at as we're iterating over and I miss sublime text.
Uh, we're iterating over Yes, thes two values index.
And then you should be able to guess what I meant to have here.
And that was an enumerates, um, and also in who?
Maybe today's not a day for tutorials.
For me, she's on then many batches.
Pride going off of the screen here, see if I can help out here.
There we go.
So four index, current state action reward new current state done in a numerator mini batch.
I guess I got stuck there because I was explaining these and then pointing out the transitions.
Anyway, now it's a valid loop.
So if not done, anybody knows.
If Adam just has a built in pet a kind of syntax checker, let me know.
Post count below.
Because this is what comes on the paper space machines.
And I just can't be asked to always be throwing on sublime text anyway.
Current cues.
Okay, so that now we have our current cues.
We fixed our enumerates.
Um, Now what we want to say is the current queues for the action that we took is now equal to that new cue value.
Once we have that, we're ready to append to our features and our labels.
So ex dot upend, um so that would be the current state.
And then why dot upend current cues so again features labels.
This is the image that we have, and then current cues or the Q values so cool.
So just just in case that's not getting through exes wise all right.
So, uh, now, now that we've got all that, we can pop out of this four loop, and then we're going to say self dot model don't fit.
And we've got another really long line here n p dot array of our X values divided by 2 55 numb pie array of why.
And then our batch size will be equal to exactly what?
Our mini batch size waas.
Um, and then I think I should be valid.
I wasn't Never do this, but I just I'm running off the screen.
I think that was still run because we're in the parameters here, Um, and then we're gonna set their boobs to zero.
Then we're going to say in make this debate as well.
Then we're going to say shuffle is false because we've already grabbed a random sampling.
We don't need no benefit to that.
Callbacks will be equal to that custom call back that we wrote.
So self dot tensor board.
Um and then So we're gonna do that fit If Terminal state is well, actually, if Terminal State Els none.
So we will fit all of this if we're on our terminal state.
Otherwise, we will fit nothing.
So now what we're going to do is come down still in line with what?
We're just writing there.
And we're going to say if Terminal State, if that's the case, we're going to self dot um target update counter plus equals one.
So if we've made it thio end here and what we're trying to do here is determined if it's timeto update our target model yet, so the next thing will just throw in right underneath this.
So, um, updating, updating to determine if we want to update target model yet.
Okay, so So we'll keep adding to that.
And then if self dot target tar target update counter, if that is greater than update, target every.
So this is yet another, uh, constant that we've got a defined if we've hit that number or we've actually, if we've exceeded that number, what we want to do is self dot target, underscore model that set weights.
And we just want to set the weights to self dot model dot Get Wait.
So we just we just copy over the weights from our initial model and then self dot target update counter needs to be reset and we set that too.
Zero.
So now let's go ahead and update target every.
And, um, I'm trying to look and see what I have with I think it was a five.
Yeah.
Five.
So we'll go up and we'll say set, uh, update that to five.
Okay, once we have that, um, now we have to bring in or environment in our blob class.
So those are two things I'm just gonna copy and paste in.
I don't think there's any benefit will run through them, just in case.
But I'm gonna come down.
It's a lot of ads broke.
That's our modified tensor board.
We already only mother wrong tutorial was like assumes.
I saw a modified tensor board.
I was like, Oh, I'm just gonna go to the bottom, I guess.
And then let's say where is So that's our agent class that we've already written.
Modified tensor board.
Yes.
So this we're gonna grab, I think we'll just grab all the way here, and we'll just pace that above, and I'll talk about that here in a moment.
So I'm grabbing fall of blob M and the code that comes right after that, so defining basically everything from blob M two modified tensor board.
We're gonna grab that.
Come over here and pasta.
Okay, so we have a blob environment.
We're saying the sizes.
10 really nothing too fancy here.
I don't really think anything's been updated.
Their, uh, the the reset, like all these kind of methods here, like everything's been kind of turned into a method.
Instead, if you still wanted to move on this case, we also have to I think we need that to be self dot enemy self up food.
I'll just throw that in there for Oh, before I forget that itself.
Enemies soft food.
Um, but we're not moving them for the basic one anyways, So this is all good whether or not we want to render it and then get image.
This is just so we can actually pull an exact image from our environment because that's we're going to use the image as input as opposed to before where we were just using the image to display it.
And look at it now.
The images actually input values.
So we're not doing the delta off to the food or to the enemy.
But you could you really could.
It's just that I couldn't really decide which way I wanted to do it for this tutorial.
X.
That's actually it's easier to learn that in this case this model would perform way better if we were doing that.
But the problem is, it's not really what we're using de que ends for, uh, most of the time, So I decided not to do that.
But you can.
And maybe if you want to, um, you should treat that like change that, like make the input be those outlets for observation things like See if you can do that.
Um, that should be a pretty easy change if you understand what's going on anyway.
So that's our blob then.
We're just kind of defining the blob.
And here we're setting some initial things, and then we're setting random seed toe one, Uh, just so as we do this more and more.
Hopefully, our results will be good to compare to each other when we change hyper parameters and stuff.
Um, if you wanted to train multiple models on the same machine, you can use this code as well.
Uh, then finally we're gonna make this directory called models if we need it.
So just a bunch of helper stuff.
Really Nothing specific to de que ends.
So I just didn't really early see any benefit for us?
Toe right out.
Blob M.
You're welcome.
So, uh, the other thing would be Let me check your got blob.
And we're missing tensor board.
We got the d Q.
An agent.
So the other thing we don't have is the actual blob class, so I'm also gonna take that.
I will take that.
Copy that.
I'm gonna throw that down here again.
We've already kind of written up, so I'm actually gonna pay Cut that.
I mean oh, it should.
I just tried a copy of from my main computer new, Uh uh, here.
So this we'll take our blob class and then we'll we'll check some of these other values, but we still have some code to write at the very end here.
Pasta.
So this is our blob class again.
We already had a blob class, so I didn't really see any point going over this again.
Um, the only difference here is we have this new equals.
So, uh, we're just doing another operator overloading just So this is how we can check to see if two blobs air over each other now.
So cool.
And that on And then here I've added all nine movement possibilities.
So not only the diagonals also don't move and then move up.
Down.
Left.
Right again.
Um, not complex code.
Does any reason to belabor over that?
Um, here we do movement.
Um, this keeps it inbounds.
I don't really like none of this other stuff is really changed again.
Nothing with Deke un's rust.
Really?
The labor over now, Uh, the last things that we need to do is actually do our generation and all that.
So what I'm gonna do is come to the bottom here, shoot.
I can't remember if we actually created our agent or not.
So, dick, you know, it doesn't look like we did so come to the very bottom here and ongoing to say agent equals D Q and agent.
So we've got her agent now.
Um, and then also we did.
I believe we've already got our environment created.
Should be like, right after the environment.
If I recall, right.
Just got a check.
Yeah.
So we've got a environment.
We've got her agent.
Now we're ready to start iterating over things.
And in fact, I think maybe even before we do that, let's make sure we have all of these.
And I kinda I think I'm just gonna copy Face these as well.
Um, and then we'll talk about those.
I just want to make sure I'm not missing something, so in fact, I'll just do this.
So replay memory size 50,064 5 I got her name discount.
Cool.
So I'm gonna get rid of that.
And then let's see if we've got anything new here.
Memory fraction?
I'm actually not.
I'm not.
We're not using that.
And at the moment, But that has to do with that code that was commented out.
20,000 episodes.
That should be fine.
See here, Epsilon.
We started one.
That's fine.
This is a decent decay for 20,000 steps.
Minimum minimum.
Absalon is fine.
This is for our step aggregation show preview.
False.
We can set that to truth.
We want to actually see the visuals of everything running.
Okay, so coming back down to the bottom here, that's a lot of scrolling need.
Like a paid end of page.
But, um, now we're ready to actually iterated over everything.
So what we're gonna say now is four episode in T Q Q d m for the range of one to however many episodes we decided we wanted, which was 20,000 plus one.
And then we're gonna say ask Ask e equals true.
This is for our Windows, Fellows.
Units are unit is going to be episode episode Cool.
So now what we're going to say is agent dot tensor board Don step is going to be equal to whatever, eh?
Piss ISS episode work on Then we're going to set the starting episode reward.
I can't type episode underscore Reward will set that to be zero.
We're going to say we're on currently step number one current state is going to be equal to end.
Don't reset.
So we're kind of matching the syntax of hope that wasn't off the screen we're setting.
Um, we're kind of matching this in tax from, um, opening.
I Jim makes more space here before that happens again.
Hopefully that wasn't off screen.
I'll go back and check anyway.
Wasn't really too complex a code at least.
Uh okay, current state we set that to the reset, we're going to say done is equal to false, and then we're gonna generate over all this.
So while we're not done, the next thing is if if n p dot random dot random is greater than whatever Absalon is.
So this is very similar to everything we've done before, So we're just going to say action equals np dot ard max of agent died get cues, get cues for whatever our current state is.
Um, otherwise so else action equals np dot random dot rand int between zero and end end dot action space signs in all caps.
Okay, so once we've done that, we're gonna continue on our theme of similar lease and taxed things everyone say new.
What's new state reward and whether or not we are done is equal to end.
Don't step action.
And then we're going to say I really just wonder if I up shrub type of episode somewhere else anyway, plus equals whatever the reward is.
No gear, whatever the whatever the reward is and then forward stays if show preview.
So if that is the case and not Episode Modelo aggregate steps, every will do and render such a shows the environment.
If we want that, then every step we have toe add that to the replay memory.
So we're going to say agent dot update underscore replay memory.
And we're going to update that with all those What fine things.
Uh, current state.
We've got action.
Reward new state, new state and whether or not we're done there.
So then agent dot train done and then step, and then a current state state equals new state step plus equals one.
Okay, so now, yeah, I think now we're gonna copy the rest.
There's really nothing again.
Nothing d Q and related.
And this is also code that we've already written.
Um, I'm just trying to keep this a short and applicable to what we're hopefully trying to learn.
So, uh, this is where we are now and then Now all we want to do is we're gonna add the reward to episode rewards.
We are going to agra rate various stats, so we're gonna grab the average, the minimum in the maximum reward, and then with those things, were actually gonna throw those in the tensor board rather than creating our own.
Um, you know, Matt Plot lib chart.
But we could use Matt put live.
That's totally fine.
So, like if you're following along and converting this toe like, I don't know some other machine learning framework first.
Well, how dare you?
But oh, so you could use math problem.
And then here, mineral Ward.
Basically, what we're trying to do is find a good reason to save a model.
So if a model scored really good based on you right now, we're just the ring in one thing.
But you could change it, or you could make it based on multiple, multiple things.
So first, let me just copy this over now.
So it's a copy.
I'm gonna come over here where we just were pasta and cool.
So we add that And then so here men reward if we screw up to the top.
That was one of the things that we just kind of threw.
And I think, yes, O negative 200.
So this is like if the so, if the model hits an enemy, it's a negative 300 it's taken steps.
So if the model gets a negative 200 that mostly just means it didn't get to the food but also didn't hit an enemy.
So that's still, like, somewhat positive.
Let's say so.
Uh, so if the worst agent still never hit an enemy were saying, Hey, that's a good thing.
But you could also say like you could take men reward you could set to negative 100 instead you could say, rather than, um rather than men reward being greater than or equal to that, we could actually say Average Ward.
So if the average reward is greater than negative 100 or zero, then let's save that model.
So then we saved that model.
Okay, um, otherwise we're just decaying.
Excellent.
That's the same as what we did before.
So Okay, so we've covered a lot of things.
Um, I did copy and paste some things, and I know someone's gonna be like, you made fun of people for copying you paid.
What I meant before was that people were copying and pasting from each other in this case, the things that copy and paste, it was like the stuff that we've already written, where the things that don't matter.
So, um, feel free to let me know if you like that or dislike that it would have taken probably two more videos for us to rewrite the do the environment thing.
Um, and to do what was the other thing we copy and paste.
We'll also the tents are bored blob m and we didn't really made a couple of changes.
That would be even like all those stupid l lifts like Come on, you're welcome.
So, anyways, let me know if you like that or just like that, but I think that made the most sense.
So now we're ready to actually train.
So what I'm gonna do is I'm gonna start trying, see if we hit errors.
Tried to bug through those errors.
I'm sure I made a lot of typos of And then, um, all kind of just, like, pause the recording and then I'll show you training through.
And then we can kind of look at how the model is doing and stuff like that and talk about it.
And, you know, you really have a good time together.
So So now what we're gonna dio is first, let's open in terminal.
Let's go and run D Q end part two dot pie.
No, no, There's no way we don't have said, Of course, with one thing that so we forgot tensorflow so great.
And I think what I'm gonna do, actually is because we probably are missing a few of the other imports.
I'm actually just gonna copy the imports from from here.
Um, I'm sure we're missing a few, so I'm gonna copy that.
Appear, uh, deep reinforcement.
Learning is really complicated.
Guys, There's a lot of moving parts, in case you didn't know.
Okay, So, um yeah, I'm trying to think if there's anything here that might be a confusing I've explained all that.
So cool.
So now let's go back to where I was.
I'm gonna close that cause I don't want that.
Get in my way anymore.
And let's try and rerun part to yet again.
Okay?
I don't quite understand.
Activation.
Linear.
Oh, it's a typo activation.
Okay, um, so that was in our has long time ago.
Guys, I e I know you might watch these videos back to back or think that like it was, like, one day apart, but it's been like almost a week.
Probably since I film that other video, I can't find it calm to d find it.
Let's go.
Okay.
Activation activation.
What?
I'm missing here.
Activation activation.
Linear.
Okay, cool.
Let's try again.
I wonder how many years were actually gonna hit here.
Step action equals NPR.
Get cues.
Current state.
Um, did we actually pass?
That's totally possible currents to get cues for current state.
Um, you see if that's actually get cues, why are we missing?
Um, I don't think that's true.
Get cues.
Self state.
We shouldn't You shouldn't need a step there if I if I threw that in.
It's a mistake.
Get Q State right, OK, because that's just a predict.
So I probably threw in step.
I was maybe thinking of train or looking at train.
Maybe.
Um anyway, all we're trying to do there is make a prediction and to make a prediction, all we'd a state where there would be no no need for step there.
So let's try again.
What are we on for?
Five.
It's a lot of years.
Oh, my gun is Dick Yuen has no attributes model predict.
I don't see it.
Type of there self top model, I think probably itself not model dot Predict that would be my guess.
Um self dot model underscore.
Predict.
Let's find where we wrote that.
Okay, get Q's.
I'm guessing that's it.
Let me just check.
Let's see.
Death cues self dot model.
Wow.
Even in the that can't be right.
I'm just gonna change this.
But actually, that's so weird.
In the text based version, it even says that self dot model dot predict.
That's got to be There's gotta be a mistake.
I mean, just get imagine that would be correct.
Um, who knows them certain?
I'm getting pretty tired.
Stuff is so dense stuff.
Let's try that again.
Let me go.
But to the end of this, there's no way Please just start training.
I beg you.
Self dot model dot Predict Soft was I didn't get cues model Predict.
Yeah, it wasn't get cues.
Interesting.
Okay, so it's actually training.
Thank the Lord.
Um yeah, interesting self.
Tha That's so funny.
I wonder how that how that made it in.
Maybe code up to this, I guess, because I just I type of it in the previous tutorial and then I copied it to hear I'm surprised nobody picked that up in the previous tutorial to be honest that that that one's already been released for the current channel members about 100 people.
So I'm surprised nobody noticed that one was like, What's model predict?
But I guess maybe they thought that was coming.
I don't know anyway.
Okay, cool.
So while that's training, shout out to speaking of channel members, some of my newly upgraded for different months channel members of clicking the wrong thing.
There we go.
Fu ba 44 You honest Gessler Both you guys on working on month number 12 which is crazy.
Frank Lloyd Jr.
Three months payments I ATI five months Krystle in four months in the Kessie Ueki for seven months.
Thank you all very much for your support.
It allows me to do things like this that take me so freaking long in the ad Revenue will never pay for this video.
So I really, really appreciate you all support because this stuff super cool but like a sack I couldn't could not do it if I didn't have some sort of outside support.
Because advertisers don't care about these videos.
Not enough, mostly because not enough people care about these videos.
It's very difficult as a niche channel doing like it's already programming a snitch.
Nobody wants to watch programming like, pretty much Every one of my viewers is, like, super unique.
Like you guys were watching this.
I assume most of you for fun.
So you're already Ah, pretty rare individual.
So anyways, it's hard out there for a syntax anyway.
So what I'm gonna do is I'm gonna let this keep training.
I'm gonna let it.
I think we'll let it go through 20,000 steps and we'll see where we are.
I might run it again.
So sometimes as the model learns and does like a cycle through Absalon, it's useful to do it again.
So reset Absalon back, let it to kay again.
And I found that to be true also with learning rate.
So just with any neural network.
So even if you could let it continue training or decay it slower.
So, like I say did 20,000 steps, you decayed learning rate, Um, at a certain rate, if you want it and then did it again for another 20,000 steps.
Now you've got 40,000 total.
Well, if you decayed at a certain rate, that would have gotten you to the same ending learning rate.
Often it's actually better to cycle it.
So do 20,000 reset 20,000 again?
It's better toe to do that than to do one kind of passed through.
It kind of helps get you out of these local traps.
So I'm finding the same thing to be true for Absalon.
So rather than let's say, your end goals into 100,000 steps, you could decay have a decay in such a way that it decays smoothly all the way through to 100,000 steps.
So it ends at a pretty low.
You know, let's say less than 1% Absalon but doesn't get there quickly gets there, you know, right on time to reach 100,000 steps.
Let's say it's I find it's actually better, or so far I've found it's actually better to instead do 20,000 steps into five cycles like you'll end up with higher accuracy.
Is higher rewards, that kind of thing.
So anyway, kind of cool.
So maybe I'll get to show you guys that here, we'll see so questions, Whatever.
If you're feeling fuzzy, it's normal.
Ask questions either postman, the comments or come join us in discord dot g slash Centex.
Otherwise, I'll see you guys for you in a few moments for me and probably looks like four hours.
So see you then.
OK, a moment or two has transpired since since the recording, but I don't want to make this a separate video.
So here are the results of a two by 256 confident.
And basically what I did was I did, like, three passes, so this would be like our first pass through the data, Uh, and you can see the average started at negative 1 80 And it kind of made its way up to negative 73.
And then basically, I loaded where that model left off.
Ran it again.
So just kind of re decay the epsilon again.
So there we go, this time getting to a peak of about negative 50.
Uh, and then I ran it again, and this time got it to be an average of about What's this one negative.
37.58 Uh, and then we could We could try to keep continuing to do that.
I just It's taking a lot of time to train these, and this isn't necessarily like the coolest model in the world.
Okay, is not the coolest environment, so I don't really want to spend too much time going over this.
I've already taken quite a while to do this.
Siri's like you guys don't understand.
Like, just just doing the last two videos has taken well over a week.
Probably two weeks worth of work.
Especially if you include what would have had Daniel doing as well.
I mean, it's just taking a lot.
So, um, anyway, s Oh, stop it here.
And we're gonna move on to another R l algorithm.
So But what you can see it's actually pretty cool.
Some of some of these stats, like you can just see as you continue training, everything got better.
Like you can see on the first past, even the minimum everything.
Except for pretty much word, Max.
Like, pretty quickly, it starts to max out.
But here you can kind of see the minimum even improves, which is pretty good, because it, um I guess I guess we're still not above negative 300.
Like if we could get men to be above negative 300 that we mean, we never hit an enemy, which would be really cool anyway.
Cool.
So that's one.
I have another tensor board, and this is a different model.
And what this one is is with many enemies as well as movement is turned on.
So I had 10 enemies, one food, one player and I allowed movement to be on for all of the enemies.
And this was the training.
I was really hoping to see something a little better than this mostly because over here I was kind of expecting that we would even we would be better than this s.
So then I started to wonder.
Well, you know, part of the problem is, if you were to really look at this image, most of the images, like black and or just nothingness.
So it's like, not useful in any way to the model.
So then I started to think maybe that's the problem.
What if I had a lot of enemies then we'd have, like, mortar work with per each convolution that happens in each window.
Ah, but that didn't really solve the problem either.
So on, if anything, it just made it harder.
So you see, here.
It really never did any better than, like, probably the best average was negative 1 20 which is pretty good.
I mean, that's still out.
Most of the time, it doesn't hit an enemy at least, um, And if it doesn't hit an enemy, uh, I guess it didn't exhaust steps either.
So I mean, at some point, it's still eventually on average, getting to the food.
But I don't really think it's, uh it's that great.
And like I said, I'd rather do a cool environment, something like Starcraft or or or we could make blob M cooler.
But, uh, this is just taking way way too long, and I just don't want to necessarily die on this hill.
I want to explore some other hills and they figure out which one I really want to do.
Battle on.
Um, so anyway, um, I think that's it.
I'm trying to think if there's anything else I really want to say other than this, if you wanted, um, one thing you could do is trying to remember now.
So I have Deke, you and tutorial.
But these air, actually from a different directory, I'm sure remember, if this is D Q.
And I think it's in D.
Q and stuff.
Let me check real quick.
Open up, deke and stuff.
Let's go to logs.
Yeah, OK, so So we could kind of watch, uh, the model if we want and so I can go up here.
The last time I did this, Daniel made fun of me for this being the way that we test her.
Like we just watched the model, but I think it's fine.
Um, So what we'd want to do is load in the best ish model so past three, and then we could load in.
Well, the one that's here is actually Okay, so the average 25 max 250.35 is the average.
Well, like this one better.
And then the men Negative.
Whoa, whoa!
Wow.
Taking a 4 54 men dame.
So it almost took 700 steps and Oh, this is a negative seven.
This is a positive who stick with positive.
Okay, this is probably when I grab so it's fine.
So load in that model weaken set, render show previewed a true aggregate.
I think preview is when we aggregate also load model.
I don't do this we didn't even do in the tutorial.
But just so you can see, let me just say this, uh, this is just in the model, um, methods.
Let me just search for load model here.
Let's just go to next.
Next, right?
So if load model is not none, so if we said it to none, it'll create a fresh model.
Otherwise, if we set something there, it's gonna load the model, and then it just loads the model.
Okay?
Nothing van See you going on there?
Um and I think that should be everything that you would need to know as far as changing toe from what I just covered in the tutorial anyway, so we could load model aggregate every show preview.
True.
Okay, let me then come into here.
Open in terminal is do Python blob world CNN nine moves load.
Dupuy.
So hopefully this trying to think if I aggregate steps every I feel like I might be missing something as far as getting it to pop up every game because that was 50.
What episodes?
And it's been just like, today's a fresh day and haven't looked at this code.
Um, let's go check.
So show preview.
Let's just copy Paste.
Let's show preview may go down to the code here ever and not Episode 1000.
So instead, let's actually say module 01 So we'll see every single episode, Thank you very much.
So this is just a bunch of episodes.
I mean, at least he's not hitting the enemy, but he's pretty stupid things, doesn't it?
Oh, Absalon is high.
Okay, look it.
Let's fix that real quick.
Uh, prior, Why you want to stand alone script for this, But, uh, you know, whatever.
Uh, absalon, we will just set to be zero one more time, okay?
Yeah, he's much quicker now cycling through and just getting food.
He's not even hitting the enemy at all.
I've been seeing him hit yet.
That time, At least you didn't get the food.
Likes the corners.
Boom, boom, boom, boom.
Okay.
Yeah, So I mean, this one's not terrible.
I mean, it's pretty good.
He really likes to go into that corner.
That's kind of funny.
Uh, nice.
Okay.
Uh, all right.
So that's all I'm gonna do here.
We're going to get into some of the other reinforcement learning models, and like I said, I'd like to do something more complex, and it kind of bums me out that it was taking so long.
But, hey, that's that's deep learning for you Does take a while to train models.
So, um, I would like to do something for more advanced, but I want to figure out which thing I want to do.
Which, which reinforcement learning algorithm do I want to use for that?
So still on the hunt, we could do some cool stuff with de que ends for shore, but I definitely want to check out a three.
See for sure for that, maybe need stuff like that.
So, anyways, if you got something you want me to look at, let me know.
Uh, leave a comment.
Come join us in the discord.
Whatever.
Uh, otherwise, it is a new day, So I will shout out some of my other channel members.
Eric Gomez on five months robo.
Five months, Alexis Wong, seven Uday Bingley and meant annoying all seven months as well.
Thank you guys very much for your support.
It allows me to spend a week on a stupid algorithm.
Actually, is really cool.
I like that.
I can do this.
I hope that I can continue being ableto do tutorials like these that I will never make the ad revenue back on.
So thank you guys very much for your support.
Uh, till next time.

訓練與測試 深度強化學習(DQN)代理--強化學習第6頁。 (Training & Testing Deep reinforcement learning (DQN) Agent - Reinforcement Learning p.6)

訓練與測試深度強化學習(DQN)代理--強化學習第6頁。 (Training & Testing Deep reinforcement learning (DQN) Agent - Reinforcement Learning p.6)