使用Scikit-learn進行機器學習--使用Python和Pandas進行數據分析第6頁。 (Machine Learning with Scikit-learn - Data Analysis with Python and Pandas p.6)

字幕列表影片播放

What is going on?
Everybody, welcome to Part six of the Python Data analysis and data Science with Pandas Tutorial Siri's In this last installment, we're going to be talking about applying machine learning to a panda's data frame.
So basically, what is the typical workflow when doing machine learning with pandas data sets.
So generally for if I'm going to do machine learning first I do on my data pre processing in panda's anyways, so it's just convenient to know.
Okay, well, what's the next step?
Once we've got our features, how do we feed it through a model?
So that's what we're doing here.
We're gonna be grabbing a new data set and we're gonna grab the diamonds data set.
So go ahead and download that.
Put it in your data sets directory.
So what's our objective gonna be?
Well, we're going to be predicting the price of diamonds, So basically, this table contains a bunch of values here.
We've got, uh, the carrot cut color clarity depth table.
I don't know what that means, but it's table percent.
We got the price, the X y Z, which is length with in depth of that diamond.
So the curiosity is can we take all of those values except for price, feed those through a regression model and predict the price of that diamond so that in the future, when we get diamonds and we don't know how much to pay for our diamonds Ah, we could just run him through a model.
So typical regression task here.
Nothing too fancy.
And the question is, are these features all descriptive enough to give us the price of this diamond?
Uh, probably we shall find out.
Uh, you can use another day to see if you'd like.
You can, uh this one has almost 54,000 rose, which is quite a few samples that we could actually work with here, which is pretty important for traditional ml.
You're gonna want prime more than 10,000 rose, I would think.
And then, if you want to do, like, deep learning, more than 100,000 arrows were Sydney's regular machine learning models.
Nothing too fancy.
We're not going deep learning here.
But if you want to learn more about what we're doing, one of their more about machine learning or just psych it learn or just basic ML models in General, you can check out this initial machine learning to toil.
Siri's here.
This one, basically what we do is we go through starting with regression, which is what we're gonna do here talking about how regression works.
We do an applied example using psychic learn.
And then we actually write a regression model ourselves.
And really, we do that with all of them.
So we do that with regression K nearest neighbors support vector machines and so on.
So the idea is to learn about the model Learned how to apply It was psychic learned.
And then how do we write that ourselves from scratch?
We do use numb pie and stuff like that, but not using some sort of machine learning library.
So it's pretty cool.
Serious than if you wanna learn about deep learning.
Check this one out.
So anyway, if you want to learn more about that kind of stuff and parameters because there's a lot of parameters here that we're gonna be working with, And if you want to know Maur like you don't feel like like you feel like you're still kind of in the gray areas for us.
What all this stuff means you can check that out.
That Siri's out.
So now what we want to do is let's say you just are a complete amateur.
You can check out this choosing the right estimator chart.
Basically, if you just Google choosing the right estimator, you will find this.
Uh, this is from psychic learned, by the way.
And this is just how you can pick a specific estimator for a nester class fireman.
It's like your model.
How do you pick the right one for what you're working with now?
So for us, we have more than 50 samples.
We do not want to pray to the category.
We do want to predict a quantity.
So because we're doing a brush and we want to predict price, which is a regression task as opposed to classification, where you're trying to predict, you know, one out of five classes in this case it's not classification, because it's like any price, right?
We're just We're trying to come up with some sort of calculation for price.
Do we have less than 100,000 samples?
If we if no, then So if we have a lot of samples, that's kind of confusing.
If you have more than 100,000 samples.
Go with Esther.
Dear Pressure.
Otherwise, you should go either here or with support.
Vector bird Russian with a linear kernel.
So we'll probably USP our linear kernel.
And again if you don't know what that is, you just click on it, right, And this will tell you.
Okay, here's what you need to do.
So Okay.
I understand.
Right?
Okay.
S K learn import SPM Got it.
Here's your class.
For God it do if it got it easy, right?
So let's get started.
So we're gonna import pandas as p.
D.
We are going to say d f equals P d.
Don't read C s V date.
It's not a capital D.
Data sets and diamonds don't see us.
We then we also want to say index call equal zero because this data set graces us with a love That's avocado with a lovely index.
Not only is it useless, it's also in string form.
Fascinating.
Uh, okay.
So come back over here.
So we're gonna say next call equal zero, because that way we don't generate duplicate indexes.
In this case, the index column is completely useless, but one thing you always want to keep.
Take note of is let's go ahead and just d f dot head here.
Any time you're doing machine learning, it's really easy to cheat.
Even when you're trying not to cheat.
You're doing your very, very best.
It is easy, not thio.
So looking at this data set, it's easy to cheat.
Anyway, looking at this day set, Um, basically, we like all of the columns here are meaningful columns, and then you have price except price.
President will prices meaningful as well, But prices were trying to predict.
So when we go to build this model, we actually want to use all of the columns and sent for price.
But when we get to that point, I'm gonna point something out.
Uh, and, uh, if I forget someone comment below because it's important.
OK, so that's our data set.
Now, one thing with machine learning is all of the data that you pass into your model at the end of the day.
Basically, all machine learning is is linear algebra.
Okay, so everything has to be numbers.
We have to convert everything the numbers and ideally, they're meaningful numbers, because if they're not meaningful there.
Useless.
So there are ways, like so we have cut color and clarity.
All of these need to be converted to numerical values in pandas.
A cz well, as probably psych it learned.
And I know it's in Kare Aas and tens airflow.
There are always, like to categorical methods that you can call So, like in pandas, we can say D F s.
So, for example, what if we just had d f cut?
Uh dot unique.
Okay, there's not very many here, but you can imagine it's an area where there's a lot.
Uh, so this presents problem.
We need to convert these two numerical values.
Well, one option you have is D f cut.
Um, anyway, so I was trying to see if that had a meaningful or I don't believe it does.
But anyways, do you have cut, uh, and then we could say dot as type.
And we can convert this type too categorical.
Actually, I think it's category and then dot cat dot code.
Um, you're on that real quick.
Uh, codes.
There we go.
Okay, so this will take cuts.
It will figure out k.
How many unique sare there and then it just assigns are you know, the 1st 1 it finds is zero Sorry zero.
The 2nd 1 is a 123 and it just keeps doing that until it's done, it's reached the maximum number, and so it just assigns an arbitrary code to our cut.
The problem is, we're doing regression.
Likely linear read Russian.
And we would prefer our values here to have meaning behind them.
So because is, this isn't arbitrary, right?
Premium is better than fair and so one, So we want to preserve that order.
So we're gonna preserve that order.
So just know this exists so later.
If you're doing classification, you just need arbitrary classes.
So when you're doing classification, uh, you could use this totally to make your classes into codes.
But in this case for features, we want our features to be meaningful, so we're not gonna do that.
So instead, what I'm gonna do is create dictionaries for all these things.
And in fact, I'm a copy and paste ease from the text based tutorial because there's nothing to gleam here.
So, copy paste.
Wow, Are we museum out a little bit there little far in there on, then Copy and paste this one.
So this will just be dictionaries that we're gonna map.
How did I know this In my diamond efficient auto?
No.
I found these keys from the description here of the data set.
They ordered them and stuff for us.
So, um yeah.
So now, uh, interestingly, it's started this 10 It shouldn't matter, but that's funny.
I kinda wanna fix it.
Let's fix it.
I don't know how I just noticed that now, but anyway, let me fix that real quick.
456 and seven.
I don't really want to pass a value of zero to my regression model, if I could avoid it.
Mean, fix that in a text based version.
Okay, so now that we have these where you just want to map them, uh and so I'm just gonna come in here and I'm going to say D uh, cut equals d f cut it die map, and then we just map cut class dicked.
Right?
That's it.
And then we're gonna do the exact same thing for clarity and color.
So I'm just gonna copy paste, copy, pays, do this copy, paste, paste.
And then It's just clarity dicked and then color dict.
So just colors all I want color, color and color.
Awesome.
So, um, apple those Let's just check it with the d f dot head at the very end.
Great.
We now have our data set is basically it's ready.
It's been converted.
We're ready to pass this through an actual model.
Sort of.
We'll talk about why not in a 2nd 1st we need psychic learn.
Let's go to our favorite command prompt and do a pip installs Psych it dash learn.
We will grab psychic learning.
While we're doing that, I'm just gonna come back over, I think, and just start typing so cool.
So we're going to import s k learn, and then we're gonna go from S K learn import S T m.
So, um, doing it I want to create space, but I can't because of private installed, yet we have.
Cool.
So what can I do that?
Awesome.
Okay, so what we want to do first is any time you've got data, you wanna probably shuffle that data because the latest thing often we're gonna train and like, order.
And the latest thing is gonna be Maur biasing than the first thing that model saw.
So she's usually a good idea to shuffle the model, especially if they're sort.
If that data set is sorted in any way and is this data set sorted in any way, well, we come over here, and if we scroll to the tippy top because he price does appear to be, uh, this data set appears to be ordered by price now, so that's a problem.
And then also, it could become a problem later on.
So first I'm gonna shuffle this data set just to get it over with.
So there's many ways to shuffle a data frame.
There's a way in pandas to do it by shuffling by index.
But I don't want to do that because that's ugly.
The way the way pandas has you do it.
I just don't like psychic learn is it's really simple, and we're using psychic learned already.
So I'm gonna say d f equals s k learn dot You tills dot shuffle DF done.
That data frame is shuffled.
Now we want to assign values for X and y so in machine learning generally, capital X is your future set lower case.
Why?
Sometimes you'll see people use upper case.
Why it is your are your labels.
So X, what is X X is the feature set.
So this is the list of features that points to that label.
What's the label price?
So the list of features is basically everything except for price.
Right.
So this pretty simple In this case, we just do d f dot drop.
And we're just gonna drop that price column on axes one because we're not trying to drop Rose.
We're trying drop columns.
The men convert that to values, and then the next thing is D.
F.
Uh, this is just the f price dot values.
Now, eyes did say that, um, I wanted to point something out, So let's just run that real quick.
And then let's print.
Let's just look at X really quickly.
So look at how ex compares to up here.
For example.
Uh, it appears that this is indeed Cara cara.
I'm thinking kare AAs.
Anyway, this is Carrot right now.
One thing to watch out for now, imagine at the very beginning that we did not do Index Cole.
So if we hadn't done that, this index value here.
This course is probably having a string panties.
Might have fixed that for us, but imagine it didn't.
And now one of your columns was actually gonna be index, which is just a incremental number one to whatever.
Would that be a problem, or would that just be no ways?
Well, that would be a problem that would inform the model of price.
Why?
Because this data set is sorted by index, apparently because indexers increments by one.
And it was sorted by price.
So that is one way that you could, just unbeknownst to you, have cheated.
And that's the kind of stuff you gotta watch out for.
It's always just it's the super difficult you're gonna find.
This kind of stuff happens all the time.
It's really possible that I'm cheating in some way already on this dais.
Excitement spent too long trying toe.
Make sure I don't, but, uh, s so if I do, let me know below.
Um But anyway, uh, that's the kind of stuff you gotta watch out for because it will get you.
So anyways, it appears that we've done it right.
One way you could always be certain is to instead of, you know, blacklisting.
Just the price column.
You could white list just the columns that you want to include.
But I don't feel like typing all those out, so that should be fine.
Now, Uh, actually, I value Okay, now we want to, um I also kind of wanna fix, uh, on a scale s so I think that's what we're gonna do now.
S.
O S P.
M.
We're also gonna import pre processing.
In general, scaling often helps, sometimes only a little.
I've never seen it hurt.
So generally what you want to do is scale your data, and the target is usually between 01 But the really the point of scaling is to simplify your data for a model X.
At the end of the day, these models again are just linear algebra.
So the simpler we can make this problem the motor.
We can reduce this problem's complexity, the better.
So it's scaling.
We're just trying to bring the range of all these values to something a little more digestible by our model.
So I'm just gonna say now X equals pre processing dot scale x on your own time.
You can comment this out and see how much of an impact it makes eye or go to the text versions tutorial.
I showed the difference.
I just don't wanna have these models are gonna take a little bit to train, and I wanna waste time.
So anyway, uh, so we've scaled our data.
Now what we want to do is we want to split apart training and testing data.
So I'm gonna say test size equals 200 because we got, like, 54,000 rose.
We get tons to work with.
Now, we're gonna say ex train.
Uh, why Train x test.
Why test?
So we want to split apart.
Uh, some number of our data that the model will never see.
So ex trainer wide train.
These are the things that our model is gonna fit against.
So we're gonna train it on that data, and then later, when we're done training, we're gonna test our model on this data because it's never seen this data.
So if you tested it on data that you fit, you're likely to perform much better then on data that it's never seen.
So the real true test is always out of sample data.
it's never seen again.
This is a really easy thing to screw up and cheat on and somehow get 100% accuracy, but not really so keep that in mind.
So now ex train is going to be equal to X up to negative test size.
So this is up to the last 200 our case Copy Paste, Make sure that is why otherwise you'll not be happy.
Uh, and then this will be X, uh, for the last test size.
So for the last 200 Colon, don't forget that either.
That would be annoying.
Okay, so now we're ready.
Now we define our classifier, so see, left is generally the standard for classifier equals and we've imported SPM already and will do S v r and then we'll say Colonel equals linear.
And again if you want to see linear versus I also run an r b f, uh, gets have already Oh, eclipse.
Um probably back.
Yeah.
Also run the I do s o s your aggressor.
I think our BF and linear kernel I can't remember.
I might only do rbf to be honest, I can't remember anyway, if you want to see some you can check that out or just run it yourself.
See how you how you do it.
Anyway, We're gonna get you a linear kernel, then, um, Now we do the training, which is called fit.
So see, left off, not sit fit.
Uh, and that's ex train.
Why train?
So that process is gonna take a moment.
So I'm gonna go ahead and run that, and I'm just gonna glance over it.
Make sure I did everything right for a waste of a whole bunch of time.
Reprocess price values, bubble bubble law values here.
Scale strain.
Why train up to the last 200?
And then this is the last 200.
Okay, great.
So now, once we fit a model, we'd like to know How good did it do?
So you could do that with self doubt?
Score.
And we will score based on X Tex Tex test.
Why Test Thescore is going to be an r squared coefficient of determination in the sense of basically zero is bad.
One point.
Oh, is great.
Same thing.
So like with that?
So that's how you score a regression model.
In the case of classifications, it will be quite literally isn't right.
Is it wrong?
0 to 100% done.
So it's a little more simple.
There, in this case, is just, like is more of a factor of, like, how wrong are you, like with Russian?
And probably in theory, that's even more accurate to be honest.
S o.
C on that score, OK, and then the score is fine, but especially with regression, we go declassification.
The score is, like, totally obvious, like Okay, so we screwed up 30% of the time with Russian.
You're almost never gonna be.
You're You're not gonna be perfectly accurate if you are.
The challenge was to easier.
You cheated.
So the question of, like, r squared is kind of like super vague, like, especially the sense of diamonds.
Like, does that mean we're thousands of dollars off or we like, just like a few dollars off of each timing?
So we're only, like, a few dollars off, like that's awesome model.
So what does that mean?
Well, we could just look at it so instead what I'll do.
Um Okay, so we got our answer, and it says we did, uh, 0.874 which is actually super accurate.
That's pretty darn impressive, to be honest.
Um, cool.
So that's a good looking score, um, or r squared.
But now we want to see it with our own eyes.
What did it actually predict?
So I will say is for, uh, for X comma y in list zip.
You don't have to make this list.
I'm converting it to a list so we can slice it.
Uh, because I don't really want to show all of them.
I don't think it's 200 things.
I guess we could show him all.
To be honest, let's just show him all this is the only one will probably do anyway, So zip.
And then we want to zip together X test.
Why Test X.
Y, uh, Then we're gonna say prints and we'll make this an F string model.
And the model predicts, See, left up, predict and in predict you always pass a list, and it always returns a list, even if you just want the one thing.
So predict X and then we'll say the zero with because we're only expecting one thing.
That's what the model predicts.
And then we'll just say actual and actual is whatever.
Why was cool.
So that should be 200 things.
So we'll just print that out, and we can see how quickly that went.
Uh, and then let's zoom in a little bit.
Okay?
So, uh, we can see that the model is, you know, close on.
The first to hear the model suggests that we pay somebody to take the diamond off our hands.
They're probably not ideal situation there.
Um, okay.
And then we continue on most of the I mean, we're, like, in the general ballpark.
But, I mean, some of these are, like, still pretty far off.
But you can see that.
Like, you know, this $12,000 diamond, the model new.
Okay, this is MME.
Or expensive and more valuable than this diamond and so on.
But we still are off quite a bit on some of these, like this.
Like negative.
So the question is, why is that happening?
I don't know.
Um, but generally, the way that we would overcome something like this just for the record, is you've got this model, and then generally what you're going to see is someone trains another model.
So maybe I will show the RBF, Colonel.
So get the heck over here.
Okay, So what I wanna do is I'll just take this cop, cop, cop, cop happy.
Go down here.
Pasta.
Uh, Colonel, let's do our B f fits.
Great.
Will start that ball rolling and we will come up here and we'll just run the score and we'll do this on then I'll print this.
You're so we'll see it at the tippy top when it's done.
I believe this will take a long time to run, though.
So I'm a positive and then come back.
But otherwise I think I'll pry just we're basically done here.
But I do want to see you What that happened.
So So, General, what's gonna happen is when when you actually use machine learning models in practice, chances are it's not like a scenario where you've got, like, one model to rule them all.
That's not how it works.
You generally will do something more like a voting classifier or an ensemble of class.
If IRS so you'll have 59 33 91.
The classifier is right.
You have a ton of glass fires and then they all will votes or in this case, they will all make predictions, and you'll take the average prediction or something like that.
Also, if one of them, like in the case of diamonds, no diamond should be in the negatives.
So it's one of the classifier is predicted a negative.
You itches like toss that thing out like that.
Toss that prediction out.
Um, so in general, when you combine a bunch of class if IRS, they tend that almost always improves comic pre processing.
I've never seen that actually hurt unless you use like a horrible classifier for some reason, and sometimes it only helps just a tiny bit.
But yeah, if you always want to seek out the best performance, that's a great way to do it.
Like you could probably make a voting classifier and psychic learn has methods for doing that.
You could pry make voting classifier and get certainly get r squared to be way better.
Okay, anyway, uh, quick shoutout to the channels Most recent members zouk E v J.
Trevor and how Yang thank you guys very much for your support.
You guys help me do what I really love to do.
So you guys for that are awesome.
All right, we're done.
So 55 not it definitely did worse in terms of r squared.
But if we zoom on in enhance, we, uh we appear to not have any negatives, which is a perfect example of why we tend to use voting class fires.
Because while this classifier is less accurate in terms of or squared, um, it's not doing it in his negatives.
Nonsense.
So, uh, that carries some value, but yeah, you can see it's, like, way off on a lot of these.
Okay.
Anyways, you guys complain more with that.
If you want to see if you can make class pirates even better.
Uh, our class for a regression model, that's even better.
Uh, anyway, that's it for now.
If you guys have questions, comments, concerns, whatever, feel free to leave them below.
Otherwise, that's probably for the serious.
I might add some more later on, but probably not.
I don't seem what else I need to add about pandas, To be honest, just take out the pandas docks to learn more at this stage.
So I will see you guys in another video.