Placeholder Image

字幕列表 影片播放

  • what is going on?

  • Everybody.

  • And welcome to yet another installment of the data analysis and data science tutorials with python in pandas.

  • In this tutorial, we are going to be working with a new data set the US minimum wage by state from 1968 to 2017.

  • So this is the least amount of money and employer compay an employee e organized by state and has high and low value both in that times, terms or amount, and then also calculated out using the CP I, which is like the cost of living, Let's say for a really quick term, um and so they've got it in value in, like, 2018 value numbers.

  • So we'll probably working with that will work on the true low.

  • So the actual lowest minimum wage by state or something like that just to simplify things.

  • So, uh, anyway, this is the date is that we're gonna use.

  • So go ahead, download that.

  • Extract that, and I'll have that in the data sets directory.

  • So coming over here, let's go ahead and read it in.

  • So, uh, import pandas as PG and E f equals pedido, read C S V data sets slash minimum wage data dot C s v.

  • Uh, Okay, so let's go ahead and read that in.

  • And boom!

  • We get this nasty, nasty air.

  • It's some sort of unique code two code error and we get global blah.

  • So busy we're seeing here is pandas wants to by default use utf aidan coding.

  • But for some reason, we haven't encoding issue.

  • Now, I have not confirmed this, but I'm going to guess if we look at table data and we come over here and look at, uh where is it?

  • Yes.

  • Oh, table data.

  • Is this scraped Unclean data from the U.

  • S.

  • Department of Labor.

  • So that's where this date is coming from.

  • So I'm going to, uh, guess that it has some sort of issue there.

  • Which is odd, because when you scrape your getting data and utf eight, right, so I don't know.

  • Anyway, we haven't encoding issue, So the first thing that I would always do is like I would be by default.

  • Encoding is almost certain to be utf eight, but, um, if you hadn't issue like this, the next best thing to try is gonna be Latin encoding.

  • Let's try that and sure enough, that works.

  • So that's where we'll go with for for now.

  • But one thing we can do is just go ahead and save this so we don't have to deal with that ever again and say, D f 10.2 c s v And let's say that his data says stop min wage dot c s v with the encoding of u u T f breath eight.

  • Okay, cool.

  • So that is when a quick example of ah, why we might use pandas to convert things.

  • So next What we're gonna do is, uh, the f equals p d dot read C s V and we will just read in this c S V.

  • So, uh, let's just do well, read that end, if that had cool.

  • Okay, so this is the data that we're working with.

  • Um, And what we're interested in mostly is probably this low column now, uh, basically the objective here.

  • Um, at least interesting thing would be to almost do the same thing that we did in the last tutorial, But I'd like to see some sort of correlation.

  • So this is by state over time, a value that's changing much like we had with avocados.

  • And in fact, you could carry on this tutorial with the avocado data set up to this point at least, um so are our objective is gonna be kind of the same thing.

  • It's gonna be okay.

  • We've got these columns of values, um, or this column that contains a bunch of values and actually want to make that all the top.

  • Now, what we did in the last tutorial is totally fine.

  • But I'm gonna show you guys a new way to do that the panda's way.

  • So, uh, so let's see how that works.

  • So I'm gonna say g v for grouped by equals d f dot group by And then we were We're gonna group by state.

  • So I'm just gonna say state there, And then what we'll do is, uh, we can say, Well, we could do a couple of things.

  • So this is going to create this group.

  • I object.

  • So one thing we can say is g b dot get group, and then we can get a very specific group.

  • So let's just say like Alabama for the just the first state that we see here.

  • Get group Alabama.

  • Uh, dot set index and we'll set the index to be the year and then we'll just print out the head.

  • Okay, so that's one way Weaken.

  • Just grab a group.

  • But the other thing is, we can actually just generate over the group.

  • So, for example, we could do something very similar to what we did before, but do it in less code, basically.

  • So we're calling this actual minimum wage equals p d dot data friends.

  • So that's gonna be the same.

  • And we'll do that as well.

  • And then we're gonna say is for name group in D F dot group by state, we could also have just saved the g b above.

  • But I'll just do this for the sake of clarity of exactly what's going on here.

  • Um, so that's how you will generate.

  • So you'll get name because the thing we grouped by, it'll be that specific name.

  • And then you'll get group, and that will be basically your data frame.

  • So for name group.

  • Okay, so then all we have to ask at this point is, if act men wage dot empty, we're going to do something pretty similar.

  • But we're just gonna say act men wage equals group group dot set index and we'll set the index as year.

  • And then again, we want that to be a data frame, not a Siri's.

  • So we're gonna say low 0.0.2018 and then we're gonna throw in a dot rename.

  • And then this is how you can rename call.

  • Um, so I'm gonna say rename and then we're gonna say columns equals was hoping it would look a little better as we went off the screen.

  • But that's OK.

  • And then that would be a dictionary.

  • And inside the dictionary, you're gonna put, uh, the original name and then the new name you wanted tohave.

  • So the original name is low 0.2018.

  • That's becoming a challenge to see.

  • It's unfortunate anyway, Low ego Loaded 20.

  • And then when we want to change that name, too, is whatever the name is.

  • So it's just the state.

  • Okay, so now we've done the rename.

  • Let me make sure dot Rename column.

  • So I need to close off our name.

  • That's really hard.

  • This doesn't run off the screen as well as one would hope anyway.

  • That or if I do this, could we see it.

  • Yeah.

  • Okay.

  • So I just zoom out for now.

  • Uh, group set index year 2018 rename.

  • Calm.

  • Cool.

  • Okay, so that's if it's empty.

  • Um, otherwise what we would want to say, I'm gonna copy that line, and then we're just going to say else acts men wage equals acts, man wage dot Join in that exact line.

  • Okay, Hopefully been my joy.

  • It should have another, um, is it that it is there?

  • Okay, we just still can't totally see it, but anyway, let me zoom out even further.

  • Um, just make sure you fully closed all that off.

  • Okay, So now we'll come out down here and act flips Act men wage dot head.

  • Whoo!

  • Okay, now, let me clean that up.

  • Actually do that now, back in.

  • And hopefully we don't have any super long lines anymore.

  • Um, that makes things hard.

  • So Okay, so here we have, uh, everything is the actual, you know, the low data organized by state and then a year, which is pretty cool.

  • So kind of the same thing we did in the avocado just using group I rather than using python logic again.

  • Um, you can always use your python logic to do things.

  • But chances are in pandas, there's a built in way because some doing something like this is a super common task.

  • This isn't like you're not the 1st 1 to need to do this, so it probably exists.

  • Um, all right, so now what are some other things that we can do, like with this data?

  • So one of the cool things, like right out of the gate that we can do with the new data set is like, active in wage dot describe, And this just gives us a quick kind of rundown of some basic stats of our data frame.

  • So in this case, we could see things like Count.

  • This is just how many rows of information we have mean this is the average right.

  • So this is the average minimum wage over the course of all these years in 2018 terms, the this is the standard deviation, so we can see which states have varied very greatly.

  • Uh, this is the minimum value, and then these air, like your percentiles, and then this is the maximum value.

  • So just really quickly, we get a lot of information with thought describe that.

  • Then maybe later.

  • Either we could use this and maybe graph it in some way s so that you could just craft the standard deviation for all the states and then see immediately.

  • Okay, who's got the biggest standard deviation?

  • Something like that.

  • Um anyway, so that's one cool thing that we can do to get a lot of information from our data really quickly.

  • One of the other cool things that we have is correlation and co variants just instantly just built in for us so we could do Ackman wage dot core.

  • And I'm gonna do dot head just so we don't print out like this massive table.

  • So with correlation, this would display all the states by all the states.

  • Right?

  • So, like, Alaska will, you know, this diagonally will be perfect correlation with each other.

  • Obviously, we are.

  • We do have some not a number data, so we can look into that next and figure out okay, what's going on there?

  • Um but otherwise we get correlation.

  • What the OK, Probably gonna finish this tutorial shortly.

  • Um, I'm surprised I didn't lose power there.

  • That was really close lightning.

  • Anyway, um, where was I s So we got the instant correlation data here.

  • What perfect timing for my Franken mug were Frankenstein's monster mug.

  • Anyways, So, uh, so the one thing that is curious to us is immediately when we see data like this is like, what?

  • Why are we getting these Nan's?

  • Why are these zeros we want to see?

  • Like, Did we make a mistake, or is that just something inherent about our data?

  • So, um, so one thing that we can do is start to kind of look through our data and just kind of see, like, what's going on here?

  • So one thing we can say is deaf dot head, and immediately we can see Alabama has nothing.

  • It's got no values and then even like the table data just has this, like dot, dot, dot with it.

  • So probably something is not working.

  • And then what you could do is owner if they put the link.

  • Yeah, they do put the link here so you could just click on the link literally and come here and see.

  • You know, um, there's definitely dot, dot dot It's It's not like some Java script that we would have to click It's just simply not there.

  • Um, so coming back here DF dot head Um, so one thing that we could say is we could just check like, Okay, so if we say issue the f equals D f where d f?

  • Um, let's just sit.

  • Really.

  • A lot of these values could work, but we'll just a low 2018 is equal to zero.

  • So that's our issue.

  • DF right on, then issue DF dot head and we can see.

  • Okay, there's quite a few of these.

  • And then we could even find out just how many states air problems by doing issue DF Uh, ST is it a cat?

  • Yes, capital dot Unique.

  • And now we can see.

  • Okay, these are all the states that for whatever reason, we just can't get data on.

  • Um, so that's okay.

  • Uh, like I said, e, I don't know why that's an ellipsis here, but it just is I don't know why we're not getting that data, but whatever.

  • So, um, coming on over here, um, I think we'll just have to move on from that.

  • So what we could say is we could just get rid of that data like that's that's no good.

  • We have no data, so this is gonna be super common in a lot of any assets where you've got data.

  • Um, but then you've got a lot of missing data for whatever reason.

  • And one thing we know for sure is all of the Alabama data is no good.

  • And that's the truth for all of these states.

  • All the Florida data.

  • Georgia, Illinois.

  • There's no reason for us to continue with those states in our data set.

  • It just doesn't make any sense.

  • So what we can say instead is we can import numb pie as MP because we're gonna use n p dot nan and you should have numbers should already be installed.

  • If it's not, If you get an error, you can pip in stall number I.

  • But, um, you you should have gotten umpire when you installed pandas, so I don't think I need to say that, but just in case, Okay, so now what we want to do is act men wage, and then we're gonna replace all instances of zero with np dot nan.

  • So we're just gonna say not a number.

  • I know zero is the number, but actually What's happening is we don't really have any data at all.

  • And, uh, whoever parsed this decided to call that zero rather than no data.

  • I think I don't actually see zeros being reported here, So, um yeah, so I think that our we're better off just replacing that with not a number, because we actually don't even know.

  • And it definitely wasn't zero.

  • I don't think surely those states actually have a minimum wage.

  • Who knows?

  • I could learn something new.

  • Um, I'm pretty sure, like wasn't Texas in there?

  • Yeah.

  • Okay, so Texas had minimum wage.

  • So anyway, moving along.

  • Uh, okay, so we replace it with not a number, and then we could just say, drop in a and then we'll say axis equals one.

  • So if axes is one that will replace, that's like going to do that will get rid of columns.

  • So if a column contains not a number, it will get rid of that.

  • If the default is axes equal zero, which actually means rose.

  • So if any nan is in a row, it's going to get rid of the entire row, which obviously we don't want because then we would lose all our state data.

  • So, actually, we just want to get rid of the columns that have Nan's so axes one.

  • So then we could say dot core dot head.

  • And now we just have the states that actually have minimum wage data.

  • Now, the next thing that we could do, um, is we could check for, um let me think you're, uh well, so we don't actually know if all of these are, like all zeros like So, for example, maybe in the 19 sixties, Texas had no minimum wage, and then later it got it.

  • So maybe that's why it had Nan's don't really fully know.

  • So one thing we could say is like, four pronged lem in lips in issue D F on then ST unique.

  • And then we're just gonna ask if problem in men wage core.

  • Um Oh, do even defined.

  • I think I just Okay, yes, way just printed that out.

  • So what I'm gonna do instead is just two lips, uh, this and then we'll say men wage core equals AC minimum wage.

  • Okay, core dot columns.

  • So if we find that problem, we'll just say print.

  • Um, we're missing something here or something.

  • like that, but we definitely shouldn't.

  • Because we should be dropping the entire column if it had any Nan's.

  • So the other question you could ask is like, how many zeros are in?

  • Uh, any of those.

  • So, like, one thing you could ask is like, you could count the number of zeros in the state called him or something like that.

  • And that would probably give you a better idea, because if it has any, Nance were dropping it.

  • But anyways, we've dropped him all.

  • Anything that had a nan we've dropped later.

  • Maybe we should check and see, like, maybe later on they got a minimum wage.

  • I don't know.

  • Anyway, Cool.

  • So, um, trying to decide?

  • Um, in fact, we really we could just show that really quickly.

  • So, um, so we could say grouped issues equals issue.

  • DF dot grew by and we could group by state.

  • And then we could say, like, grouped issues dot get group, uh, Alabama.

  • And then dot head would just proud a couple here and we get Okay, so now we can see.

  • Okay.

  • Footnote Nan's also these air old zeros, right?

  • Because, well, we haven't replaced them within P dot nan yet.

  • So one option we have this because it's been replaced with a zero is we could actually just some that entire group.

  • Right?

  • So if you said something like this if we said, uh, this group issued get group Alabama and then we said, uh, load 20 18 and then he said that some we get zero, right, because the all the entire Colin just adds up to zero.

  • So we never get minimum wage data for Alabama so that we could do the exact same thing we could generate over this.

  • And it's a four state data in grouped issues, huh?

  • What do we want to do if data, uh, low 2018 dot Some does not equal 0.0.

  • Then do you miss something?

  • Okay, none of those hit so literally.

  • All of them always had no data in them.

  • Okay?

  • So just a quick way that we could actually check that.

  • So, um okay, I think I'm gonna stop it here.

  • And in the next tutorial, we're gonna talk about visualizing this correlation in, like, a big correlation graph, which is gonna end up sending us down in another rabbit hole entirely.

  • But It'll be fun.

  • So anyways, that's all for now.

  • Questions, comments, whatever.

  • Feel free to leave in below, as always.

  • Thanks everybody for your support, subscriptions, donations, all that stuff.

  • And I will see you guys in the next video.

what is going on?

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

Groupby - 使用Python和Pandas進行數據分析 p.3 (Groupby - Data Analysis with Python and Pandas p.3)

  • 3 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字