Placeholder Image

字幕列表 影片播放

  • what is going on everybody and welcome to part three of the reinforcement learning as well Askew Learnings and says We're starting with tutorial Siris.

  • In the last video we created A We got the mountain car to get up the hill basically using Q Learning, which is cool.

  • But we did it with a bunch of parameters here or Constance that I just set because I knew it would work.

  • But in the real world, you're gonna have to figure these values out.

  • And the way that you're going to do that can't possibly be watching every episode.

  • And seeing doesn't work because, you know, showing, let's say, even with 25,000 episodes showing every 2000 that's still not a statistically significant sampling of how your models doing so So we have to find a better way.

  • And one thing that we can do just a simple metric to track is just simply the reward.

  • So proper episode reward tracking system is probably enough for most like basic operations.

  • Now, for really complicated environments, you might need to do a little more, But ah, you're not gonna use basic que learning for a complicated environment anyway.

  • So with that.

  • Let me just show you guys one way you could track the these metrics overtime.

  • Now, there's a 1,000,000 ways you could do.

  • This is just a simple programming task, honestly, but it is something we definitely need do.

  • So I'm just gonna show one example.

  • And then after that example, I'm gonna show you guys, um something kind of cool with the cute tables at the end is a just a slight bonus.

  • So, uh, anyway, the first thing we're gonna do here is just come underneath que table.

  • Also, let's change episodes.

  • Let's do, uh let's just do 2000 episodes show every 500 episodes.

  • Ah, that will just help us to get through this this year.

  • So the first thing that we're gonna dio is just underneath que table.

  • Let's just create a couple of new lips, a couple of new values.

  • I meant to just zoom in a little bit there.

  • So first we're gonna episode rewards.

  • This will just be a list that contains each episode's reward just as a list on.

  • And then we're gonna have an aggregate anger, get rewards dictionary, and this dictionary is gonna track the episode number basically, uh, then we're gonna have the average, and then we'll have the men, and then we'll have Pepe troubles will have Max.

  • Okay, so this is good.

  • This will just be a dictionary that tracks episode number.

  • So this would just serve is like an ex.

  • Yeah, the X axis for a graft.

  • Basically, the average this will be the trailing average.

  • So you almost think of it like I'm not a moving average, but it's the average for any given windows.

  • So every 500 episodes, for example, this will average over time.

  • So as our model improves, average should go up minimum.

  • This is just gonna be tracking for every show.

  • Every what was the worst model we had.

  • So, um, s so so basically, it's just what's the worst?

  • It's not hurt concept, Max.

  • What was the best one?

  • So why do we want these?

  • Well, the average might actually be going up, but the minimum or the worst performing model is still in the dump.

  • And so you might have cases where you actually prefer that the worst model is still somewhat decent than toe have the highest average or something like that.

  • So this is just barely getting into it.

  • This is, You know, what you're gonna actually be looking for trying to optimize for is gonna vary depending on what kind of task you're attempting to achieve.

  • So I'm just gonna keep it fairly simple here on Just go with that.

  • So the next thing we're gonna do is we're gonna come down into the episode loop here, and we're gonna track at this sewed under scorer.

  • Reward were going to say that equals zero.

  • Then we're gonna come down to the adoration of steps.

  • Uh, which is here on then we get a reward here, and what we want to do is add that reward.

  • So episode underscore reward plus equals reward.

  • Uh, then what we want to do?

  • What's your deal?

  • I guess because we're just not using it.

  • Yeah, I'm to find episode.

  • Wait.

  • Whoa, Epic.

  • Always spend every type of episode reward EP, sepsis owed episode reward.

  • Okay, coming down here.

  • Episode reward plus equals reward.

  • Phenomenal.

  • Okay, so that will add to episode reward every time.

  • And then basically, at the end of the episode, what we'd like to do is upend episode reward to EP rewards.

  • So we'll come to the very end of this loop.

  • Come back here.

  • And the first thing that we're going to say is EP underscore ep rewards.

  • I don't have my keyboards nine or what?

  • I'm pretty sure I didn't make that typo.

  • Uh Yep, rewards dot upend Although EPPS Rohwer episode rewards is a little hard to explain with the keyboard issue, but anyway, EP rewards don't upend.

  • Um, we want to upend that episode reward.

  • So the total reward at the very end, what do we want to do?

  • So then the next thing we're going to say is, if not episode, uh, episode Modelo show every in that for just for the That's the same as saying if episode ma djalo show every double equal zero.

  • So if it doesn't equal zero so you can kind of short on this just by saying if not episode module o show every basically it just means every show, every do this thing.

  • So I can't tell you how many times in python I need to perform a task like this, and it's kind of unfortunate that this is the like industry standard.

  • It would it would be kind of nice to have some sort of way to be like, like, every.

  • Like, Would that be a nice statement in python?

  • Every just saying something like that.

  • Anyway, Make it happen, guys.

  • Uh, if not, episode show every.

  • So now what we want to do is we actually want to build, uh, we're gonna work on our dictionary.

  • So basically, first thing we're gonna do is calculate the average reward and that's going to be equal to the sum of our episode room.

  • It underscore rewards.

  • There we go.

  • It'll be the sum of EP rewards minus the show, every colon and then divided by.

  • And I know some people were gonna be like, Why don't you just divide my show every well, in this case, some like this minus show.

  • Every colon just means like the last let's say 500.

  • But if the list is only 300 long and we do the last 500 it's still only gonna be 300 long.

  • Either way, we shouldn't run into that as a problem.

  • But just in case I'm going to instead say, the Len of EP rewards minus show every cool, so that gives us our average reward and now we're ready to actually populate that dictionary.

  • So it was like aggregate EP rewards.

  • And then basically, we're gonna have, um, four things that we're gonna do.

  • First of all, the episode that's just gonna be equal to or not equal to, we're gonna say upend episode Oh, episode.

  • And then I'm gonna take this.

  • I'm gonna copy pasta, pasta, pasta, average men and Max.

  • And so here, we're going to upend the average than here.

  • I'm gonna copy and paste this pepper words minus show Every colon thing, paste, paste.

  • And this one will be.

  • Was that men?

  • So the minimum of that value and then here we're going to say, Max.

  • Cool, beautiful.

  • So now we've built this dictionary.

  • The next thing that would be useful possibly is to print an F string.

  • Um, so we can just just for this specific episode, we could print, like, all of these things so we could say, uh, episode, colon, Uh, and then average colon, uh, men Cole and and then Max Colon.

  • And then we'll just copy and paste into here.

  • Cut.

  • Copy, please.

  • Thank you.

  • Thank you, men.

  • Max, a copy this paste, and then do the same thing for Max.

  • Here, Copy.

  • Taste beautiful.

  • Okay, so that will give us some some metrics, like as its training for us.

  • So we can kind of see how things were going besides just seeing the simple replay.

  • Because, like I said, that's just probably not gonna be enough.

  • Now, at the very end, we're gonna graph it.

  • So at the very tippy top we're gonna import, Matt plot lived up pipe lot as pl And then we're gonna come down to the bottom again, and, um, we can close the environment.

  • We just shouldn't be a problem.

  • We'll do it here appeal t dot uh dot plot.

  • And then we want a plot.

  • Basically all three of these combinations.

  • So, basically, it'll be the ex will always be Theo episode.

  • Uh, and then we'll do a V G.

  • And then we're gonna say label equals a V G.

  • And then I'm gonna copy this paste paste, and then we'll do men men will do Max Max, And then we'll do a peel tea dot legend and will say the location here is four.

  • And then finally peel tea dot show.

  • And for the location that just you can kind of pick where you want the legend to go, and I'm gonna put it in four, which just means the lower right.

  • So in theory, everything should be going up over time.

  • So hopefully the lower right isn't in the way if it's in the way, we didn't do very well anyways, So we'll say that, Uh, and what I want to do is open up a command line.

  • Python que learning three will get that at least started.

  • I'm gonna move this, like, here so we can hopefully see the metrics up.

  • Shoot is what I was going to say.

  • List index at a range.

  • Rewards rewards, Dewey, not upend er up.

  • Rewards dot upend episode reward.

  • We are a pending.

  • We've must have done something stupid.

  • That was a reward.

  • Equal zero episode reward plus equals reward At the very end, Epper words don't upend episode rewards.

  • Ah, Then hold on.

  • Um, I'm just not seeing what the issue is here.

  • List index.

  • Omanis show every.

  • Okay, I see it.

  • I see it.

  • So coming back down here.

  • So the issue was minus show every colon.

  • Ah, simple error.

  • Let's just make sure we didn't make that anywhere else.

  • Okay, I think we're good.

  • So let's try one more time.

  • Cool.

  • Okay, So everything's native.

  • 200 here?

  • Uh, not too.

  • Too shocking.

  • So while we wait for this to do what, 2000 episodes, I'm going to give a shadow to my most recent brand new channel members, Mike Smith, A Jay's Sheeni.

  • Santa knew Bo, Mick and Harv demand.

  • Thank you, guys for your support.

  • Welcome to the club.

  • You guys are awesome.

  • So it looks like we are almost done at 1500.

  • While that's stinks.

  • He, like, almost made it to that flag.

  • Uh, okay, hopefully, at this point, we'll get a beautiful graph.

  • Wow.

  • Look at that.

  • Look at that.

  • Was that average or Max?

  • I can't even tell that I think that's average.

  • Or that's Max.

  • Rather, of course, it's Max.

  • It's the woman's on top.

  • I was just testing you guys, to be honest.

  • Um, okay, so we can see here that things are improving now.

  • We only did 2000 steps, so it's no surprise that, um, you know, the max episode was doing pretty good, but the minimum is still gonna be negative.

  • 200 like it just never made it to the flag.

  • Now we could continue.

  • I'm just not gonna waste your time in, like, you know, graph of super really long one and just wait for things to operate through.

  • I've actually already done it.

  • So, um so this is, uh Well, this is a video.

  • I was gonna show you guys and also show it to you.

  • I'll just leave it in a different tab.

  • Instead, we're gonna pop over, uh, here, and I'm gonna just full screen this thing, and I'm gonna scroll down.

  • So this is kind of basically what we were just looking at, and then I went a little longer.

  • Um, actually, I'm not even sure.

  • What did I change here?

  • Oh, this was just with an epsilon change.

  • So one of the things that you would use these graphs fours is like changing.

  • Like, how does changing the epsilon decay value change?

  • How does this start and stop episode, number, change things, all that kind of stuff.

  • And then also you can train a model for a little bit without an Absalon.

  • Then add the Absalon.

  • How does that change things?

  • So you could do all kinds of stuff like this totally up to you what you want to do.

  • So then, even after 10,000 episodes, you can see the minimum you like.

  • I said it got in the way because it didn't do very good.

  • But there is a little tick up right here where the minimum was at minus 200.

  • But on.

  • And then again, this was back to changing the Epsilon to King value.

  • You can see how that change some things.

  • Ah.

  • Then here I changed the discreet observation size from 20 to 40 on, and we can see here that at least the improvement is like super linear.

  • Like here.

  • You can see it.

  • It goes up and then it kind of flatlines plateaus.

  • Ah, where is here?

  • It definitely is constantly improving, like to a point in continuing past or least looks like it's gonna continue past any other point we've ever had.

  • And for the same number of episodes.

  • So then I'm like, Okay, let's train it for really long.

  • So we did 25,000 episodes and we could see maybe, at least for this setting, the sweet spots around 20,000 because we can see the worst agents were still, uh, at least making it to the flag.

  • Because if they don't make it the flag, it's going to be a minus 200.

  • So as long as they eventually make it to the flag, that's pretty good.

  • So for, you know, after about, you know, just before 20,000 to, I don't know, maybe 22,000 or something.

  • Um, for 2000 episodes we found or more than 2000 says I am probably 3000.

  • Um, it never failed.

  • It was successful every single time.

  • And then for whatever reason, it's first come back down again.

  • So anyway, so you can kind of see how things changed.

  • And then from here, you could start tweaking like the Epsilon value, Or you could reintroduce up salon.

  • Christians are by this time the Epsilon is not there anymore.

  • So maybe we want to reintroduce Absalon because maybe the agent is finding itself in a position it's never been in before.

  • So that value has, like, never been tweaked.

  • Right?

  • So stuff like that, um okay, And then here is where we start talking about saving for acute tables.

  • So basically, is just a lump.

  • I save And then you can just save the cute table as whatever you want.

  • So, for example, just Well, yeah, you could save per episode, but that would be a lot of cute tables.

  • So instead, what I'm doing here is that module Oh, thing.

  • So you could you could decide, you know?

  • Hey, I want to save So for example, just take this.

  • I cut this.

  • Come over here.

  • You know, basically, you could save a cute table, every show, every so you could paste this in.

  • You could say OK if you had it.

  • We don't have the Q tables directory, but I made one in the text based version.

  • Make a cute tables directory.

  • You save the episode number, and then you save the Q table there.

  • And you could just do this so you could do this every episode.

  • You could do this every show.

  • Every or you could say, if not episode Colon arm.

  • Sorry.

  • Module.

  • 0 10 OK, and then you could instead save that.

  • So in this case, it would save the Q table every 10 episodes s.

  • So it's really up to you how you want to do that.

  • And you don't actually need to do that.

  • But you would do that if, Like, if you graft select, for example, after I trained this model, it was clear by 25,000.

  • Well, it kind of went back to having some failures.

  • So what I would have wanted to do immediately after having some of these failures.

  • Billy No, I want the model at 20,000 like that's the one I want or I want the one at, like 20.

  • I don't know.

  • 21,000.

  • It looks like I want that one.

  • If I hadn't like, if I only saved the Q table, the very end.

  • Too bad.

  • Now I gotta retrain, I guess all the way to 20.

  • And who knows if it's gonna look like that again?

  • Because there's a high degree of random this year.

  • So So that's one reason you might actually want to save the Q table fairly frequently.

  • You probably want every episode, but maybe every 10 or every 100.

  • But the other reason you can save the Q table, um, is I'm curious.

  • What happened here.

  • Go for 100.

  • I'm not sure what changed here, like what's different about this one?

  • Um, than this one.

  • Unless it's just like doing it again.

  • This one looks really good, though.

  • Anyways, I'm not sure what changed for this picture anyway, Um, so once you have the cute tables, this is what I did to generate graphs of the Q table.

  • So in this case, you can see I'm going to say I equals 24,999.

  • So basically, Episode 25,000 because they started zero.

  • Ah.

  • Then I just graph the Q table.

  • Basically.

  • So what we have here is the cue table.

  • Don't forget I made the observation space a 40 by 40.

  • So for action zero, this is your, um your all of your cue values space will, not.

  • Q values, but busy.

  • What I did is if the Q value is the max Q value for each of the possible actions, I marked that that combination with green dot Otherwise, it's red dot so you can see here over time.

  • I also pride could just kept it green dot but whatever.

  • Anyway, um, my initial plan actually was to create a three D chart and then, um, and have, like, cubes basically, and I started to do that and then it actually wasn't as cool as I thought it was gonna be.

  • Then I was like, Well, I could just graph like each of the actions, and this is kind of what I came up with.

  • So this is at the very end, and you can actually see here.

  • Action zero.

  • That action zero was, I think, push left action One if I recall, right is just do nothing.

  • And the action, too, is push right.

  • So you can see in these kind of this cluster here, it's like, this is pretty much always a push left and like some of these that aren't pushed left or actually probably just those air still just left over from the random initialization and we just haven't hit that combination yet is probably what's going on there.

  • Um, and then here again, I can't imagine many circumstances where you wouldn't want to be pushing one way or the other.

  • I mean, there's probably some like, as you come up, like maybe only want to go so high up the hills or something, like when you're swinging back the other way.

  • So maybe that's it.

  • I really don't know.

  • Ah.

  • Anyway, so you can see how anyway, it creates these little clusters, which is kind of cool.

  • And then here, this is I equals one.

  • So basically, after one episode, this is what it looks like.

  • So obviously it's pretty random looking because it was randomly initialized.

  • So you basically you go from here to here, which I kind of wanted to see happen.

  • So I have all the cute tables.

  • So then I use the following to basically generate over all of the Q tables and save the chart.

  • So I just made a cute table charts directory and saved every single cute table from basically from here to here in their graph form.

  • Uh, and then the video code is here.

  • It's the same code I used to make the deep dream stuff.

  • If you haven't seen those, your brother just YouTube deep dream Centex and you'll find it.

  • Um, Anyway, it's the same code from that.

  • And then we get a video of the Q table over time, which is pretty darn awesome.

  • So this is it over time and it goes kind of quick.

  • I didn't really know because the whole videos Apparently this was only 13 seconds but I think I did 90 frames a second or something like that.

  • But it's not really something I want to stare at for the longest time.

  • But it's kind of cool how you can kind of see how in the middle it like it's obviously starts random, and then it starts, like in the middle area of all the cute table and then slowly expands out, which is kind of cool.

  • Um, because I guess that would be like this is like the cart wasn't like a slightly off center position to start.

  • And so I'm thinking it really is like left is left on these graphs and that right is probably actually right on the graphs.

  • I could be totally mistaken, but that's almost what it looks like.

  • And then, as he continues to learn, it continues to kind of fill out that entire space.

  • But one thing you'll note is it never actually touches any of these like fully right positions because of victory was that I want to say like 0.5, but the action space for that X axis went to 0.6 or 0.7 or something like that, and I think that's why these never get touched, because the the car never goes to those positions.

  • So it's kind of cool to see it because you can.

  • You can actually see what's happening in the, uh in the example on the graph, which is just kind of cool, Um, and also just the fact that it slowly grows out and I think it's again.

  • I think it's slowly growing out because it learns to push itself further and further by building momentum.

  • And I just think it's super cool to see on a graph form.

  • It's like staring at the Matrix in seeing seeing the picture.

  • So anyway, it's gotta go.

  • Um, yeah, so feel free to make your own and change the visualization if you want.

  • This is just one way to do it.

  • I still kind of want to make the three D table thing, but I don't think I will.

  • So anyway, I think, yeah, that's everything.

  • For now, Um, if you want to go the text based for editorial, look at the video or you can make your own video change colors do all kinds of cool stuff.

  • That's it for now.

  • In the next video What we're gonna do is build our own que learning environment.

  • So immediately after I learned que learning, the first thing I want to do is make my own environment.

  • I I didn't really intend Thio do a tutorial on that just because I didn't really see why.

  • But everybody's asking about it, and it's, like, so obvious, like the first thing I want to do is build my own key learning environment.

  • So I guess I shouldn't be surprised that lots of people want to see us make our own.

  • So, uh, that's what I do in the next tutorial, and the benefit there is that you can if it's your environment, you can make certain tweaks and changes and, like slowly add complexity or remove.

  • Or, you know you have full control over the environment.

  • Where is, like with mountain car?

  • I can't really make that environment too much more complicated.

  • Um, I could go with a different environment, maybe, or something like that, but yet you never have a lot of control.

  • If you can make your own environment, plus you construct toe, understand how you could take some like real world examples or maybe a game that you want to make on your own or whatever.

  • So anyway, that's what we'll be doing.

  • The next video is actually making our own environment and then doing Q.

  • Loring on the environment.

  • My hope is my expectation is I can do it all in one video, so it's gonna be the the goal.

  • So we'll see.

  • Questions, comments, concerns, whatever.

  • Feel free to leave them below.

  • If you wanna hang out, ask questions and get hopefully quicker answers.

  • Pride.

  • Better answers because it's usually not me that answers.

  • During the Discord Channel, it's discord dot g slash Centex Shout out to all the mods and helpers in there.

  • You guys are awesome.

  • Uh, and if you like the content, subscribe.

  • Otherwise I will see you guys in the next video.

what is going on everybody and welcome to part three of the reinforcement learning as well Askew Learnings and says We're starting with tutorial Siris.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

Q-Learning Agent Analysis - Reinforcement Learning p.3 (Q-Learning Agent Analysis - Reinforcement Learning p.3)

  • 2 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字