歸一化和創建序列 Crypto RNN - Deep Learning w/ Python, TensorFlow and Keras p.9 (Normalizing and creating sequences Crypto RNN - Deep Learning w/ Python, TensorFlow and Keras p.9)

字幕列表影片播放

What's up, everybody?
And welcome to another deep learning with python tens or flown Kare aas tutorial.
Siri's in this video, we're gonna be continuing along on our little mini project of implementing or a current neural network to predict the future price movements of a crypto currency based on the sequences of that currencies, prices and volumes, along with another three Cryptocurrency prices and volumes.
Uh, I think I got that right.
All right, so let's get into it.
So so far, we've we've got the data.
We've merged all the data, Uh, and we've created targets.
Now, the next things that we have to do we do have to make sequences out of that.
We have to balance normalized scale, all that fun stuff.
But the next thing I want to focus on is, uh, out of sample.
So before we start shuffling and doing all these things, we need to immediately separate out our out of sample data.
Now, in the case of 10 poor old data time, Siri's data and really a lot of sequential data, I would argue you can't just shuffle and then take a random 10% the original.
I can't do that is.
So our sequences are, you know, 60 minutes long, and we've got 6 60 units in each sequence, right?
And we're predicting out three minutes.
The problem is, if we were to shuffle this data and added and then just pick a random, you know, 10% the out of sample samples would all have very, very close examples in sample.
And it would be relatively easily a relatively easy for the model to, as it over fits on the end sample data to also just that that over fitting pour over in tow, out of samples.
So instead, what you have to do is sequential data in most cases is take a chunk, um, of the sequences and separate them away.
Now, in the case of time, Siri's data, I would take that a step further and say, Really out of sample needs to be a chunk of data in the future.
So in the case of our data, we we want to take like, let's say, the last 5% of the historical data and separate that out as our out of sample data.
So that's what we're going to do next.
And I think that's the most realistic way to do out of sample testing.
Aside from maybe doing like a true forward test, like using all your data and then testing it forward, Um, in this case, this is basically the exact same thing.
It's a ziff.
We built the model, you know, 5% of time ago on and then forward tested it.
So anyways, uh, that's what we're gonna do here.
So I can't tell you how many times I've seen, you know, finance stuff or even just any sequential stuff.
But it's usually with finance that people do, you know, time, serious data where either they don't do out of sample.
And they're like, Wow, look how good my model fitting.
And it's like, Whoa, you know, with a big enough neural network and enough eh pox, You will absolutely, over fit your data to a perfect or almost perfect fit.
But they don't do true at a sample, or they screw up on the out of sample doing a method like I just said, so it's really important to keep that in mind.
All right, So, uh, let's go ahead and say like our last underscore 5% of time.
That's gonna be equal to, uh, in fact, let me just do this.
Let's first take times.
We're just going to say times is the sordid.
They should be in order, but we wanted This is really important that they are in order, so we're going to sort them sorted and then main DF dot index dot value so dot index just references.
The index values converts to a numb pyre rate.
So those were the times.
Now what we want to find is where's the threshold of, like, the last 5% of times, like, what's the What's he actually units time?
That is the threshold of 5%.
So the way that we're gonna grab that is we're gonna say last fought last 5% is equal to times, and then we want to get the index value so the index value should be something like, um, you know, let's say that it should be something like negative.
I forget how many things we have.
It's probably let's say if you had 100,000 it would be about the index of negative 5000 right?
Or it would be the index of 95,000.
Something like that, So the last 5% will be times minus.
Um, or the index is a negative.
Not minus into the value of 0.5 times the Len of really anything It could be made yet, but we'll say times so that will give us the threshold.
So let's just go ahead and print that out.
Last 5%.
I'm gonna comment out that now.
And let's just run that real quick and see what we got.
Okay, so we actually have a time stamp now.
So now we can actually separate out our validation data or out of sample data and our training data, and we're gonna do that.
Now, before we do anything else, we want to separate that out, that it can kind of create a problem with normalization stuff like that.
I'm not gonna worry about that right now, uh, in scaling, but not gonna worry about that.
It's far more important that we get this right than anything else, and we can figure out a better way later.
But we definitely want to make sure we get this right.
So we're gonna say validation.
Underscore main underscore.
D f equals main clips.
That's mehndi f um where main underscore d f dot index is well use greater than or equal to last 5% so that its or validation data is the basically anywhere where the timestamp is greater than this value.
So who should be the last 5%?
And then we want to do basically the exact same thing for the actual training data which will retain the name mehndi F in Maine D f equals mehndi f where mehndi f dot index is less than last 5%.
So now we split up the data, and the next thing that we'd like to be able to dio is now, you know, we need to create her sequences.
We need to balance, we need to scale and probably other things.
I'm not thinking about that my head yet.
Um and so what we want to do is make a function to do all those things, But we're gonna do all those things on.
Both of these are basically we want to make some code to do all those things we need to do on both of these.
So we need to make a function.
So what we want probably is a function is going to do something like allow us to do this like train ex train.
Why equals some sort of function?
I'm gonna call it pre process DF, and then you just pass it D f so train ex train, Why will be main DF gets passed.
Uh, and then we'll do the exact same thing for yeah, validation, Validation, X validation.
Why?
And we want to pass the validation mehndi f.
So that's that's what we're hoping to be able to do.
So I'm gonna comment those out for now.
And now we're just gonna go to the top here, and we are going to start creating our pre processed the F function.
So define pre process DF.
It's going to take in a DF as data frame as a parameter.
And, uh, I think the first thing that will do is scaling.
And to do that, we're going to from S K learn import pre process again.
If you don't have this, it's ah, install s K learn.
I think it's a scaler.
Uh, it might be a psychic learned, actually.
Don't know what the, uh, see, you pulled us down.
Think I need that to be running let me dio Pip install s k learn.
Okay.
Yeah.
So, Pip install ask you learn.
I didn't know if it was psychic, you know, Dash learned or what?
That's what the package is called.
Anyways, Um okay, so we get the pre processing and now we'll come back down to this data frame or the function that we're gonna build.
And first, we're gonna say D f equals deft.
A drop, uh, the well we want to drop.
Is that future?
Call him.
We don't need it.
It carries no significance to us anymore.
We only need to that column to generate the target.
And you definitely wouldn't want to accidentally leave in the future column because that would be That would allow the network to learn really, really quickly how to predict the future movement.
So we want to drop that.
The next thing that we're gonna go ahead and do is we're gonna generate over the columns and we want to scale those columns.
So what we're gonna do is four call in DF dot columns.
What do we want to do?
Well, a CZ long has calm is not target because we don't actually, uh, we don't actually target way.
Don't actually want to.
Um are you angry at me?
We don't need to scale.
We don't need to normalize.
We're gonna do any of that to target.
Target is done.
It's a completed column.
We don't need to do anything more to it now for the other columns, we do need to do some stuff.
One of the things we need to do is so DF Cole now is going to become DF cole dot percent change.
And the way the reason why we want to do that is this normalizes all the data.
So whether the price it moved from 20,005 or it move from, you know, a few pennies to a few more pennies per cent change normalizes that because, you know, Bitcoin is different poison.
That light coin is a different points in price than ethereum and so on.
So this helps us to do that.
Same thing is true with the volume.
The volumes are different in terms of magnitude, but their actual movements you know, the trends of their movements or what?
Or what we're tryingto analyze for their relationship to other prices and stuff.
So Anyways, we normalize everything with a percent change.
Then we're gonna do a d f dot Drop in a in place equals true percent.
Change will definitely spawn at least one at the very beginning.
But just in case these things love to creep in and they're gonna cause a heck of a lot of trouble for us So we're gonna drop him if they appear, then we're gonna say now, D f dot call is equal to pre processing dot scale.
And what do we want to scale?
We want to skill DF dot cole dot values.
So what is it?
So pre processing is from S K?
Learn there is some pre processing in scaling with care.
Ross, uh, have used it.
I don't know how it works.
In general, I've been relatively unhappy.
With what?
How it scales the values.
For some reason, I don't see it actually, perfectly scaling from zero toe one, which is the target.
I am sure there's a way to do it.
It's just that, uh, this works just as well.
So So I'm gonna use that, I because I'm aware it exists in care, Ross and I think the cool thing with care.
Ross is You can, you know, multi axes to data.
You can scale a swell.
And I don't think you could do that with Side Killer.
But again, I'm not tried anyway.
Many ways to skin a cat you also could scale like it's like the current value divided by men minus Max, I think.
Or something.
Whatever that there is, there's a really simple formula, um, that you could scale with.
So however you want to do it, just scale the data.
Try to get it to be between zero and one.
So this is how I'm going to do it for now.
Um, if that bothers you, feel free to do it a different way.
Okay, so now that we've done that at the very end, let's do you have to drop in a again, uh, seeing case.
So if this creates, you know, in any which you definitely does, the percent change, we're gonna drop it on, and then this here, just in case.
For whatever reason, that creates and not a number.
We're going to make sure we drop it again.
Okay, um let's do I think we got time.
We'll stuff in this sequential data as well.
So now we're gonna say is sequential data is equal to an empty list.
And then we're gonna say pre VE days is equal to and we're gonna use a day que Max, Len actually should be one word.
Excellent equals.
Whatever the sequence length is, we need to go ahead and import dick you.
So I believe it's from collections Import Day Q.
Now, Dick, you is ah, fancy little dude.
What it does is so previous days, Dick, you with a maximum length of 60.
So it's best you could think of it like a list.
And as all you do is, you just keep appending to this list.
But as this list reaches the length of 60 as it gets, new items just pops out the old items and so we don't have to write that logic.
It just does it for us.
So what we want to do is we're gonna wait until proved days at least has, you know, 60 values, and then from there, we're just gonna keep populating it, and that's gonna be our sequence.
So, uh, so now what we're gonna do is for i N d s dot values.
So what is d f dot values going to be?
So probably when I think what I should do to Let's print DF dot head Just so everybody you have to head as well as foresee India after columns.
Princey.
I just wanna make sure everybody knows, because at this point, we're starting to get a little crazy.
Ah, pre process.
Do you have?
We need at least run.
I'm just gonna come down here.
Hopefully, don't forget Thio, undo this.
Um e if I just think hopefully this will help to kind of visualize what's happening here.
I do this.
So we've got our target.
Everything's been converted.
2% change as well has been scaled s o.
And then these are all the columns that we're working with.
And these are the columns in order.
So, um, for input data to measure what the target is, we're gonna use all of these, which is like coin prices, theory and prices Bitcoin prices between cash prices and so on.
And then we've got a target.
So the sequences will be sequences of prices and price and volume for all of those, but it can't contain the target because again, that would be cheating.
So what we're gonna do is come back up to where we were working and in fact, let me.
I guess I'll leave that for now.
Uh, but I'm gonna delete this stuff.
So what we want to do is generate over the columns.
So four I in DF values.
So DF values Just convert your data frame to a list of list so it won't contain time anymore.
But it's in that it's in the order of the index still, but it is going to contain target, so we need to be careful.
So for I So I is just the row of all the columns, so, you know, Bitcoin close Bitcoin volume ethereum closed theory in call volume and so on.
So what we want to say is, as we generate over this data, um four I and D f values.
What we want to say is pre ve days pre ve underscore days dot upend.
And what we want to upend is a list causes because the sequence is gonna be a sequence of lists.
What we want there is n for n in I up to negative one.
What's going on here and for end is each value in that list of list, which again it is each of the columns, right?
So the clothes, the volume close of line, but also includes target.
So we're gonna say end for end in IE up to the last I.
So we are not taking target.
Hopefully, that's clear as possible.
So that gives us previous days Now, if previous days is equal to the sequence length that we require.
So if Len pre ve day days equal 60 and actually Nazi 60.
But seek lang, whatever that is, that is 60.
Right now we're gonna say sequential data dot upend.
And now we're gonna upend our exes and our wise.
So our features and our labels, that's gonna be enough.
It's gonna be a vampire A of the previous day's.
So it's that sequence of 60 features our 60 feature sets.
I guess we could call it, um so upend that.
And it really does need to be a list.
So it's the features and then the label.
So the label is pretty simple.
It's just I negative one.
So that's the current label right now.
What's the current label?
So Based on the last 60 minutes of all that data, we're hoping the model could predict.
Yes, that label either a zero or a one.
Okay, so we build sequential data, and when we're done, we're going to do a random not shuffle sequential data, and we probably have random we do not sow will import random.
Let's run that.
See if we hidden error.
We do know pie not to find.
We're gonna need that as well.
If you don't have no pine stole, uh, do Pip in stall number, I we will try again.
I just want this step.
Could take a little bit.
You're building sequences, but it's not Okay, so I believe we're done.
I don't think I'm gonna continue until the next video was probably pretty long.
Okay, so now we have our sequences.
We've got our targets.
We are closing in on the ability for us to feed this through a model.
We still have a few more things we need to do.
Uh, but we're getting close.
So this this one's a little dense.
So if the other ones have been kind of dense, if you've got questions on whatever's happening here or If you think there's a better way to do something, please, by all means.
Let's talk about it in the comments section, let me know.
Otherwise I will see you guys in the next tutorial where we're hopefully going to get closer.
We probably won't train the model the next one of probably the Detroit after the next one.
But we'll continue, um, trudging along so anyways, questions, concerns, whatever feel freely below.
Otherwise I will see you guys in another video.