Placeholder Image

字幕列表 影片播放

  • everyone.

  • And welcome to my mini course on the essentials of data science.

  • This mini course provides a super basic looking to data science what it is and the three main components that make up data science.

  • Data Science is a very mainstream word, like it's thrown around a lot, but it's actual definition is quite big.

  • This many courses designed to help those of you who are curious about data signs develop a better and more specific understanding of the topic.

  • They're definitely more advanced techniques within data science such as machine learning.

  • But even these can be traced back to the three essential components that will cover before we get straight into it.

  • I thought I quickly introduced myself.

  • My name is Max and I work as a data scientist.

  • After getting my degree in physics, I find myself more and more drawn into the world of data signs.

  • So instead of diving into the realm of physics research, I taught myself all the tools and techniques a data scientist needs and shortly after landed my Dream Data Science job.

  • I've since also started teaching data signs to others and have been fortunate enough to teach what is currently over 9000 students.

  • The scales off, gathered and learned over the past five years of my data signed Stern, so let's jump right into it.

  • So what is data science?

  • Well, data science is kind of summarize it in different ways, but the main parts of it are transforming data into information.

  • And this is a really big step because a lot of people talk about, you know, data and Big Date and all of these things.

  • But data by itself isn't really that useful until you can turn it into information.

  • And so if you just have a bunch of numbers appearing somewhere, and it's just, you know, so much of it.

  • No one can make sense of that.

  • And that's where you need a data scientist to be able to transform all of these.

  • All of this vagueness and kind of this noise, too.

  • That's going on, and you need to be able to extract information from it.

  • And that's what a data scientists does.

  • Now what you do with this, too, with this information or how you get this information, it's through analyzing their data, So a big part of it would be, you know, cleaning things up, doing some some processes on it.

  • And then you analyze once you've cleaned things up a little bit, and that is one of the ways that you can then get information out of your data.

  • Um, through this analysis, and you can kind of continue on and you see trends and patterns and all types of correlations, hopefully, on all of these things again build up into this turning data into information component.

  • Um, and then ultimately you also need to contextualize everything that you have.

  • So your computer can't do that for.

  • Your computer can kind of crunch the numbers and stuff.

  • But it's your responsibility also to make sense, what's in front of you.

  • And even if you see something, you just don't blindly trust it.

  • But you need to understand, you know, where am I at?

  • Where am I coming from?

  • Where is this data coming from?

  • Immediate.

  • Be able to contextualized these things and then, of course, be able to apply as well as understand them.

  • And so once you have this data, you know it's great, but turning it into an information into great information that you can use and directly apply That's where the rial power lies.

  • And that's also kind of the role of a data scientist.

  • So that's what the data, that's what data signs pretty much is.

  • And so what is the data scientists do?

  • Well, we kind of already talked about this just a little bit, but let's go over it again.

  • Any more concrete examples?

  • And so the data scientist would, for example, get and process this raw data and then converted into something a little bit clear.

  • So you can imagine kind of just like a data stream coming in.

  • And it's you have this measuring device and constantly is just measuring all sorts of data and because, like, nothing is really Constance, everything will be fluctuating.

  • I've been down, and so a date assigned to this would be the head of take all of this data.

  • It be that kind of clean it up a little bit, you know, maybe reduce this fluctuation that, you know isn't supposed to be there.

  • That's just kind of background stuff going on and then put it into a format so that you can easily plotted against some things on, and then we already get to the next point that, you know, once the state as cleaner, you can maybe do start doing some calculations on them figuring out the core statistical components, you know, like, what is the average values of these?

  • What?

  • What am I really dealing with, You know, getting a first look at first understanding of what it actually is that you're tackling.

  • And then once you have this kind of understanding that you can start to do some visualizations which help you as a data scientist, maybe see some trends or patterns already.

  • But visualization is also really key because they let you show it to other people.

  • And they're a great means of communication.

  • So they help both US data scientist as well as helping others when you try to convey this information to them, all right, And then finally, you have to suggest some applications of the information, so it's not really enough to just be able to look at it and say, like, Yeah, I see it goes up and down, and that's that's good.

  • But what does that mean?

  • How does this transfer into something useful?

  • And that's also one of the key roles of a data scientist transferring information into knowledge.

  • And so you've got this data into information step.

  • But you also need to transfer this information into knowledge and those air to really powerful things that are worth a lot a lot.

  • Um, and that's pretty much what a data scientists focusing on.

  • And then you can go further, you know, and take this data and do machine learning with it or something.

  • If you really understand what's going on or if you have some hypothesis of, you know what could happen so you can take things a lot further.

  • But ultimately this kind of turning data into information and then into knowledge, that's kind of your role.

  • All right, so let's go into the essential techniques or the essential components of data science.

  • So the first essential component we kind of touched on this already is statistics, and basically we're gonna cover this later on.

  • But let's just give ah kind of quick wrap down.

  • So in statistics, you need to understand different data types that you can encounter.

  • And so there are day I can come in different ways and we'll go again into more detail with this later.

  • But it's not just you know you get a bunch of numbers that I can come and very many different ways depending on the field that you're in.

  • And so you need to be prepared and you need to kind of be aware that data may not always just be a direct number for you.

  • And then, of course, you need to understand some key statistical terms, like you know, the different types of means and also understanding, fluctuations and data.

  • And the reason this is important is because these key statistical terms give you an overview of how this data is behaving.

  • And depending on how the data is behaving, you may want to approach it differently.

  • So if you know that your data is very clean, there's very little fluctuation.

  • Then if you visualize things, you can probably trust what's going on or if you want to maybe fit some curves to it or something.

  • But if you see there's a lot of fluctuation in your data, visualizing it is gonna be much more difficult because you just see jumps everywhere and you're not really sure which of this is actually true and which, if this is caused by you know, like some interference somewhere or someone is messed with my system.

  • And so all of these things will kind of be hinted to you through statistical terms.

  • So it's probably good that, you know, you're kind of comfortable with these things and that you can be able to get some meaning meaning out of them.

  • All right, on, then, finally it be and statistics to be able to, you know, split up on group or segment data points so that when you have this big data set, you wanna be ableto maybe split it up into smaller things, compare different regions, look more into more detail into some things and maybe, you know, isolate two components because, you know, hey, these things are probably gonna be important.

  • The rest I don't really care about that much.

  • So being able to kind of important isolate and meddle with the data a little bit.

  • So these are the kind of statistical components that we're gonna look into it.

  • All right, so the next big thing and we've already talked about this too, is data visualization.

  • Andi, we'll see why data visualization is a really key skill for data scientists.

  • And then we're also be gonna be covering different types of grass that you can use and how you can compare different number of variables.

  • So, for example, you can have one variable grass where you only look at one thing and you only want to look at this and you want to see how these how this changes, you have your typical to variable grouse, which you probably know where you have this X and a Y axis.

  • And then you can kind of see how two variables relate to each other, where you can have three variable or even higher variable graphs and where you plot maybe three different things or even more if you want, as long as it makes sense next to each other, so that you can compare multiple things at the same time, all right.

  • And now we come to the other big thing that you're probably gonna need as a data scientist, which is gonna be the ability to program now, not every data scientists can do this, but this is really, really essential, in my opinion, to your role as a data scientist, because knowing how the program is gonna make your life so much easier if you know how to program.

  • You can kind of take your ideas and your thoughts, and you can put them into actions in the computer.

  • And you can just automate everything you can customize things you can explore.

  • You can prototype, you contest, and you're not reliant on some you know, application.

  • You don't have to master some application.

  • And if it doesn't work, if one feature isn't there, you have to contact customer support.

  • And maybe it's not even possible.

  • And then you have to wait for an update.

  • Or maybe something is bugged with programming.

  • There's just you're so much more reliant on yourself, and you can really just do whatever it is you want to do.

  • And you're not reliant on other people or on the tools that other people have built for you.

  • But rather, you can just pretty much go and, you know, just do what you want to do without there being major roadblocks on Ben, we'll also look at some essential packages and python.

  • So in programming, you never want to reinvent the wheel.

  • You always want to start off with the last person left off, and so the ability to program and be able to write simple programs you would need to teach yourself, but you wouldn't need to write highly complex mathematical packages or data analysis packages.

  • Those are already out there.

  • All you need to do is be able to download them and implement them in your coat, and they're gonna work.

  • You know, they've been tested a lot.

  • There's a huge communities working on them on improving them and everything.

  • All of this is for the community.

  • And so the whole community kind of works together to improve it.

  • No one's really directly trying to make a lot of money off of it, so they're not gonna charge you all of these service fees and everything.

  • Everyone's just trying to improve their package because if it improves, everyone also benefits from it.

  • And so we'll look at some of the libraries.

  • We'll talk about some libraries that you can use, especially in python, and to help you along your way with data analysis and to become a successful data scientist.

  • In this chapter, we're gonna talk about statistical data types.

  • Now we're gonna look at the three different types of data which are summarized as a new miracle, categorical and orginal types of data.

  • Now, these are the types of data that we talked about before.

  • How you can't just expect your data to be be kind of new miracle.

  • And so we'll see numerical data.

  • But we'll also see the two other types of data that you may be.

  • You know, encountering in your career is a data scientist.

  • All right, so let's talk about numerical data.

  • First, though, numerical data is also known as quantitative data on, and it's pretty much things that you can kind of measure.

  • It's it's great numerical stuff that you can do math with.

  • You can compare it, you know, saying this.

  • Plus this makes sense that he is greater than be, um, these air.

  • You know, all examples of numerical data, numerical data.

  • Can we split up into two different segments?

  • One of them is going to be discreet, and so discreet means the values only take on distinct numbers.

  • And an example of this would be, you know, um, like you or something like that.

  • A measurement of like you, or if you do a coin toss the number of times that you toss heads so you can you know, you can have 15 heads.

  • You can have 12 heads out of, you know, 20 coin tosses.

  • You can have 500 heads out of 1000 coin tosses or 500 out of 600 or all of these things.

  • But all of these air distinct numbers and now they don't have to be whole specifically, but they do have to be distinct.

  • So that's that's the kind of very important part that, you know, there is a kind of step size that you're dealing with.

  • And, of course, you can still say Hey, you know, flipping eight heads out of 20 is better than fiddling seven heads out of 20 if you want to flip heads that this, um, or flipping eight out of 20 is worse than flipping seven out of 20 if you're going for us, many tales as you can.

  • So all of these kind of comparisons that makes sense.

  • So that's a discreet part of numerical data.

  • Then we have the continuous part, and now the continuous part is really that values could just take on any number, and they're not in limited by decimal place, So a value that can you know it could be like 1.1, and then the next value would be 1.2.

  • That's not continuous.

  • That's still discreet because you have this step size of 0.1 continuous means literally.

  • Every number from start to finish can be taken on.

  • And this doesn't mean that every possible number in the universe from negative infinity, plus infinity and all imaginary numbers and everything that comes with it that doesn't that that's not required for continues.

  • It could really be that just every number between zero and one can be taken on.

  • So, for example, let's say you have a bottle of water in this bottle of water can hold one leader.

  • Now, if you fill your bottle up and starts off empty and you fill it all the way up to the top, the amount of water that you've had needed to take on every single number between zero and one because you can't just fill up water, you know, a kind of small increments of say, Hey, I'm gonna put in 0.2 leaders every single time because the water doesn't just, you know, teleport from A to B.

  • But when you're pouring in water.

  • It's more like we see in the stream here, and the water level rises and rises and rises.

  • And so the amount of water that we have in our cup needs to take on every value between zero and one.

  • So that's an example of continuous data for, um But you see that, you know, we could be limited to 01 to be between zero and one.

  • We don't have to, you know, start at zero and go all the way up to infinity or something.

  • But it's just that the range that we're looking at, every single number can, um can be applied or every single number can happen.

  • Um, another good example would be the speed of a car if you starts and you, you know, you're standing still on your studying.

  • You're standing at a stoplight and then you want to accelerate.

  • The speed limit is say, you know, 50 miles an hour or something to get to 50 miles an hour from your starting position, your car have to take on every single speed in between.

  • And of course, you won't see that, you know, in your spot on the speedometer, it would say something like zero miles an hour, one mile an hour.

  • You know, maybe you could go to like, it's going 0.10 point 2013 or something like that.

  • So it may look discreet to you, but that's not how your car is going.

  • Your car doesn't say like, Oh, I'm gonna go in these steps sizes of speed, it's gonna accelerate.

  • And it's gonna take on every value starting from zero going up to 50 miles an hour and you're gonna When you're in this transition, you're gonna take on every single one of those speed values.

  • So that's how continuous data looks like.

  • And it's important to understand the difference between this discrete and continuous.

  • Um, just because you may want to approach it differently.

  • Now, of course, if we're dealing with computers are computers can't deal with infinite numbers in the decimal places, we have to cut it off somewhere, and so usually continuous data is gonna be rounded off at some point.

  • But it's still important for you to know that you're dealing with continuous data here rather than discreet so that you know hey, there can still be other stuff in between here m or all of these things rather than, you know, having specific steps, sizes and all you see is just kind of a bunch of lines at every step size.

  • But you can expect that when you have continuous data that everything is just kind of filled, filled up, that everything can and may even well be in between certain places.

  • So that's that's kind of the important thing to note between discrete and continuous.

  • All right, so the next type of data that will have this categorical now categorical data doesn't really have a mathematical meaning, and you may also know it to be qualitative data and categorical data.

  • It describes characteristics.

  • So a good example of this would be, for example, gender.

  • So here there is no real mathematical meaning to gender.

  • Of course, you know, if you have the data, you can say male a zero and female is one.

  • But you can't really compare the two numbers even though you assign numbers to them.

  • And you may just do this so that you can split it up later on.

  • Your computer can understand, but it doesn't really make any sense to compare.

  • You can't say, you know, is male equal you?

  • Well, you can say male is not equal to female, but you can't really say is one greater than the other or is one approximately equal to the other?

  • Those things don't really make sense because they're not well defined.

  • What does that mean?

  • Um and you can't really add them up either.

  • You can't say male plus female that that doesn't That doesn't give you 1/3 category or something.

  • So categories you can't really apply math them.

  • But they're nice ways to kind of split up or group your data, and they provide these nice, qualitative pieces of information that are still important.

  • It's just you can't really go that well about, you know, like plotting them on a lion or something like that.

  • Um so those are important things to note with categorical data on.

  • And then another example would, for example, be, yeah, ethnicity where you could also have nationality.

  • All of these things are examples of categorical types of data.

  • Um, yeah.

  • So, like we said, you can assign numbers to them, but that's really just for your code so that it's easy to kind of split them up, but you still can't really compare them.

  • How are you gonna compare nationalities?

  • There is really no definition for, you know, comparing one type of category to another, All right.

  • And so the third type of data that you can encounter is something called orginal data.

  • Now, ordinary data is a mixture of new miracle on category called data, and a good example of this would be hotel ratings.

  • So you have, you know, star ratings.

  • They're of 01234 or five stars or maybe even six stars.

  • Or, you know, whatever it is, whatever hotels go up to these days, but it's still not as straightforward to compare.

  • So I'm sure you've seen two different types of three star hotels.

  • One of them, you know, had the bare minimums.

  • The beds were okay, but it wasn't really anything special.

  • And then you had this three star hotels that you could have sworn were at least four star.

  • And so star ratings do make sense.

  • We can say, you know, a four star hotel is probably better than the three tour hotel because there have been standards.

  • There are standards for these things.

  • They have been checked.

  • You know, if you go to a Far Star hotel, you know what to kind of expect.

  • But still, it's not completely defined.

  • So, like, you know, coming back to this three star example, it's very hard if you just say, Hey, we're going to a three star hotel.

  • It's very hard to know exactly what to expect because there are different parts of three star hotels.

  • There are three star hotels and have developed onto, like, have a swimming pool, maybe, or something like that.

  • And there are those three star hotels that are really more like hostels or something that I just made it past the to start place.

  • And so there.

  • It's much harder to kind of define or didn't know what to expect.

  • Now, if you take averages of these star systems, though, then you do get a much better idea of what's going on.

  • So if you have, you know, consumer reviews or something like that, you say, Oh, from you know, 500 reviews, our hotel has an average rating of like 3.8.

  • Then you know that the three star hotel that you're looking at is pretty much a four star hotel.

  • It feels like a four star hotel, even though it may not have all of those qualifying characteristics.

  • That's the kind of feel you get from it.

  • Whereas from another three star hotel, you may have a rating of, like 2.9 or something.

  • And there, you know, you know, this hotel is more towards the lower end of three stars.

  • Some people may not even consider it to be three stars.

  • And of course, you know this rating maybe a little bit biased because they went to a different three Hearts star hotel first.

  • And then they went to this one and they were expecting something completely else from a three star hotel.

  • So they said, This can't be three stars.

  • This is two stars, but it's because of the way that the ranking system is defined underneath and everything.

  • And so when we have these averages with these orginal numbers, then they kind of start to make a little bit more sense.

  • All right, so let's go over a small exercise and see if we can identify what type of data we're dealing with.

  • So the first thing we'll look at is gonna be the Servi response to happiness.

  • Now you have people filling out a survey in and then this, and then one of the questions is you know, how would you rate your happiness and it's gonna be bad, neutral, good or excellent.

  • What type of data With this B.

  • Well, this would be an ordinary type of data because it's still in the form of categories on dure, asking for the subjective opinion.

  • But it does make sense.

  • Seekers still compare them.

  • You can say excellent is greater than good.

  • Good is greater than neutral.

  • Neutral is greater than bad.

  • But what exactly does it mean to be good?

  • An excellent, you know, where do different people draw the line for this?

  • That there's still a little bit of vagueness involved.

  • But generally it does make sense and you can't compare it.

  • And if you have a lot of surveys and you average Hm, the values you're going to get are probably going to be very well, representative or at least pretty good, representative.

  • All right, so if we look at the next thing, which is gonna be the height of a child, what type of data is that?

  • Now?

  • We can say it's probably new miracle and well, it actually most definitely is new.

  • Miracle s o.

  • The height of a child is a numerical value.

  • But let's go a little bit deeper and say, Is the height of a child discreet?

  • Or is the height of a child continuous?

  • Well, even though when you measure height, you get something, like, you know, five foot 53 or 160 centimeters or something like that.

  • Um, it's not a discrete value, because to get that height, you have to have reached every single height before, Um and so even though at the moment you may be measuring it, you're kind of rounding it off to how much your measuring tape can measure.

  • So, like your measuring tape, it's kind of limiting the height.

  • But if you had a super super pressed heist measuring instrument, you could measure not just, you know, five foot three or something like that.

  • You could really go into detail with the inches and the decimal places and there and everything I'm kind of going on.

  • So the height of a child would be a new miracle data type, but it would be continuous.

  • All right, Now let's take about talk about the wake of an adult.

  • Do you expect the weight of an adult to be either discreet or continuous, so we can probably agree that it's new Miracle because it's a weight value?

  • It's It's pretty much defined to be in number.

  • What do you expect it to be, discreet or continuous, while the right answer here is gonna be continuous again because to reach a certain weight, they would have had to have reached every single weight in between before.

  • So again, the wait is something that we can consider to be continuous.

  • All right.

  • And so finally, let's look at the number of coins in your wallet again.

  • Weaken already by the name.

  • It says number of coins.

  • So we can probably agree that this is a new miracle type of data.

  • But the number of coins in your wallets would that be discreet for continuous?

  • Well, the answer would be discreet because it doesn't really matter.

  • What's your knowing your corns are?

  • They could be 50 cent pieces that could be 25 cent pieces, 10 or five or ones or anything you know, like a two or something like that, but they're not going to be.

  • But the number of corners that you're gonna have, we're going to sum up to a whole number so you can have one corn.

  • You can have two.

  • You can have three all of these things.

  • But you can't have infinite fractions of a coin.

  • You can't have say, you know, the square of two number of coins.

  • That doesn't really make sense.

  • So you have a defined step size.

  • You have one coin.

  • And then if you have a second coin than you have to get 1/3 quantity of three, you're going in step sizes of one.

  • So for the number of coins in your wallet, we'd be having discrete numerical data.

  • In this tutorial, we're gonna talk about the different types of averages.

  • Now we're going to see the three different types of averages, which is the mean, the median and the moat.

  • All right, let's get started.

  • So we'll start off with the meat.

  • Now.

  • The mean is the typical average that you know, and really, what the mean is, is he just some olive?

  • He values up and then you divide them by the total number of values that you have now.

  • The great prose of the mean is that it's very easy to understand.

  • It makes sense.

  • We just have everything we have and it just kind of capital up and the divided about what we have and that should give us a good representation of what is the average.

  • And it also takes into account all of the data.

  • So since we're adding everything up and then but dividing by how much data we have, we're taking into consideration every single data point.

  • Now there are some problems with this.

  • So one of the problems is that the Mu may not always be the best description, and we'll see why, when we look at examples for when we should use the median on the mode and the mean is also very heavily affected by outlaw years.

  • So since we're taking everything into consideration, if we have big out liars, that's really gonna change.

  • How are mean looks like so we just have normal values, you know, between one and five and all of a sudden we have, like, 10,000 in there that's really gonna affect our mean, so mean is heavily influenced by out liars and The bigger the outlier, the more the mean is influenced by it.

  • All right, so let's see some examples of a mean We'll go through a worked example first, and we can see our data set here.

  • Just a bunch of numbers.

  • Um, and what we're gonna do to calculate the mean is we're just gonna take every single one of these numbers.

  • We're gonna add them up, and we can see the total result that would get here.

  • And then the next thing we're gonna do is we're gonna take this total result.

  • We're gonna count the amount of data points that we have, and we're gonna divide one by the other, which then gives us our mean, as we can see here.

  • So that's an example.

  • Calculation of the mean, But let's see some example applications of the means.

  • So when would we use it?

  • Well, good application would say, if you look at the time it takes you to walk to the supermarket, so sometimes you walk a little bit faster and maybe it takes 20 minutes to get there.

  • Sometimes you walk a little bit slower.

  • It takes you 25 but on average, it takes you somewhere like 22 or maybe 22 a half minutes or something like that.

  • So if you say I'm gonna go to the supermarket, you're like it's gonna take me this much time to get there.

  • Um, another good example of the mean would be exam score for a class.

  • So to gets a good understanding of how people do in an exam or in a class, you can look at the mean exam score last year.

  • And since our exam scores are kind of in a smaller range, a meeting is gonna be good to use it because you can get anything between zero and 100.

  • But realistically speaking, no one's probably going to get a zero.

  • So your range is even smaller.

  • And so you're less affected by what lawyers, and you kind of know how hard a class is gonna be just by being, you know, able to compare their meeting.

  • So if you look at one class and it's mean it's higher than the other, but they have the large number of students or something, then you can probably say, Hey, it's easier to get a good grade here or something like that, or maybe, you know, some of these it's more simpler over it is without diving too deep into it.

  • All right.

  • Another good example of the mean would be to say, how much chocolate do you require when you get this kind of sweet craving and you're not going to say like, Oh, you know, I'm required one chocolate bar to check with bars or three, but like you're going to say on average, you know, I require, you know, maybe 3/4 of a chocolate bar, and sometimes they may want a little bit more, Um, because I feel like it.

  • And when I start eating chocolate craving even more sometimes, you know, I have it up first and, like, the tasters doesn't sit right with me right now.

  • And so I have a little bit less.

  • But these were kind of the amount of things.

  • So if you have this craving, you know, either you say, Oh, I'm gonna try to be strong or you like him.

  • Why know this feeling?

  • And I know if I eat about, you know, 3/4 of a bar of chocolate or something, I'm gonna feel good.

  • My craving is going to be satisfied, so you kind of know what to expect.

  • So these air some of the examples for how we would deal with a mean or when we would use me.

  • All right, So let's look at the next thing, which is gonna be the median.

  • Now.

  • The median represents the middle value in your debt data sets.

  • Now, if you have an even number of data points, you don't really have a middle value.

  • And so in that case, the meeting is gonna be the mean of the two values.

  • So it's gonna be the two meeting values out of together and then divided by two.

  • So the pros of using a median value is that the medium can sometimes be more accurate than the mean, and we'll see some examples of this.

  • The media also evenly split your data.

  • So you're not really, you know, affected by the mean in the sense that if you have an outlier in the mean on dit drags everything to the right, it could be that your outlaw drags things so far to the right that all of your data is to the left of the mean and only the outliers to the right, so that would be extreme case.

  • But that can happen.

  • Where is the meeting?

  • You know, it's always located directly in the center of your data in the meeting also doesn't care about outliers.

  • So if you have huge out letters at the beginning and at the end, it doesn't really care, because outliers, by definition, aren't very common because they're outliers.

  • And so, if you have some of the beginning or how some of the end, they're gonna be very few in number, which makes him out liars, Um and therefore the median doesn't really care about out.

  • Like Is that much?

  • A con, though, is that the meeting doesn't really give you much information on the rest of the data.

  • Sure, you know, you know what's at the center, but you don't know how does everything around me behave?

  • You only know where is the center of our data.

  • So let's see some examples.

  • We'll do a work example first where we see our data set here and we can count how many values we have to go from left to right.

  • Then we can say we've got 123456789 10 11 12 and 13 data points.

  • So we've got an odd number.

  • And so our median value, our center value is gonna be the seventh data point because it's six from the beginning, and it's also six from the end.

  • So is equally spaced both from the beginning and from the end.

  • And so that's why we see our median value Here is 26.

  • It's located directly in the center.

  • Now what is the median useful for?

  • Well, the median is often used if you look at, you know, household incomes for a country.

  • Because if you were to use the meeting than these billionaires, they would just completely, you know, they would give you a false description of what really an average household income is because normally, if you have, you know, like an average value, you can say, Oh, the average household income from this family would be, say, $40,000 or something like that, or that would be the median value.

  • But if you were to use the meeting instead than all of the billionaires and all the millionaires in the country, they would change that household income, and then you would say, Oh, you know, the average household income per family would look like 60 K.

  • And that's a bad representation, because that doesn't actually give you a realistic look at what the average household family has on the average household family really does.

  • It's, you know, centered at like 40 K and sure, there were people below their people be high, but that's what's in the middle.

  • Whereas if you were to use the meeting instead for your average, you would kind of get this inflated household income, which wouldn't be representative to the rest of your the rest of the country.

  • Another good example of the meeting would be the distance that people cover to get to work.

  • So if you look at this in terms of, you know, Kilometers, then you can say like, Oh, you know, some people, they walk to work and it's like, you know, one kilometer at most.

  • So something like that and then you can expect people to travel.

  • Most people travel around three kilometers to work, and sure, there are some you know that travel much further because they want to live outside of the city, and there are some that travel very, very short distances because they have a house right next to the office with their house.

  • Is the office or something like that, depending on where you're working?

  • Um, but then you can look at, you know, like, where in the middle.

  • How do people travel to work?

  • What time or what distance do they need to cover?

  • And so that would be another good use of the media.

  • Um, meeting another good meeting values.

  • What do you usually spend when you buy a new item of clothing?

  • And so sure, you know, sometimes may go to that expensive clothing store and you could get a jacket that costs, I don't know, north of a couple 100 euros or dollars, whatever system you want to use.

  • And sometimes you can go to a second hand store and get it for very cheap.

  • But usually if you go into stores, a jacket, I don't know, maybe cost you, like, $100 or something like that.

  • So, you know, if you go out, you can expect to pay about $100.

  • Um, no, not really.

  • You know, taking that much account into what story going into So most of the stories that you're gonna visit are gonna have that price for the jacket, so that would be another good use for the medium.

  • All right, let's look at the third type of average that we can do, which is the mode.

  • Now the mod looks at the most common value in your data, and it's not really defined if there are several most common values.

  • But if there's only one most occurring value, then that's what your mode would be.

  • So we'll see an example of this in a second to the pros of using the mode is that it's not only applicable to numerical data.

  • So if you look at categories, for example, then you can say, Hey, we've got five people from the U.

  • S you know, and two from Canada and one from France, and you know that the mode is gonna be the US because they're five people from the U.

  • S.

  • So mode is the great average.

  • That's not only applicable to, um, numerical date on this sense.

  • Beacon technically also applied two categories or two cardinal numbers if he wanted, so that you can say the most common country that we have, where the average kind of country that we would expect here is the U.

  • S.

  • And sure there are other countries.

  • But the the average or the most common one is going to be the yes, in this case, Um, so yeah, and then, of course, and the other pro is that we allow to see what's most common what pops up the most.

  • So that's a great use of the mod if there are cases when you know recurring values happen a lot.

  • Which is the case for discreet numbers, for example, so indiscreet numbers values Riker often, and so it's good to use the mode.

  • Um, a con of of the mode is gonna be that it doesn't really again give you good understanding the rest of the data similar to what we had for the median, but also it's not really applicable.

  • If you just have a bunch of different types of data, then there isn't really gonna be a mode.

  • If there's not enough of each data, it's not really good to use the mode.

  • You don't want to, you know, have thousands of data points and their most recurring value.

  • It re occurs, like three times.

  • That's not good.

  • You want to use the mode for situations where data re occurs off.

  • So, like, we saw the country example, but let's actually see, they worked example, But also some other examples for the mod.

  • So the work example here would be again.

  • We take our data set, and we can count how many times different numbers appear.

  • And so if we go through, the numbers will see that 26 occurs the most, and so that's gonna be our mode here.

  • So we've got 22 25 that both occurred twice, but 26 occurs three times.

  • And so 26 is gonna be our Miller.

  • It's gonna be our most occurring value.

  • No, the mod is gonna be useful for things like the peak of a hist a gram.

  • So if you draw this history, Graham And if you don't know what Instagram is, don't worry.

  • We'll cover that in a later lecture to women, go into data visualization, but the peak of a hist a gram that's going to show you the mode of the data, the most occurring data, Um, a good another use of the mode would be if you look at employee and come on a company because that accompany you know, you can again have the boss, which takes off the mean and you can have, you know, higher level employees to which we kind of shift in the median.

  • But if 1/3 of your employees earn minimum wage, that's gonna be the best average.

  • Or, say, 40% of your employees earn minimum wage or probably not your employees, because that wouldn't be a very good system to have but a 40% of the employees that the company that you're looking at a very minimum wage that's not a really good thing toe have.

  • And if you look at the mod, you'll easily see that the average in this case would be to earn minimum wage, because that's what most people learn.

  • And sure, you know the boss, he or the CEO or something, you know, he may shift the mean up heavily, and then the fact that you have higher ups if you look at the meeting value, you may even well be too far.

  • Um, you know too far to the right that you really don't consider thes employees that all are in the same amount.

  • Um, but you really want to get that description, which is what you get here from the MoD And then also, the outcome of an election is where you use the mode for.

  • And sure, sometimes you may only have two values.

  • Sometimes you may have three.

  • But if you have different candidates and say you have five different candidates than the person with the most votes is gonna win the election because they have the most.

  • And so they were again, You'll use mode in this lecture.

  • We're gonna look at spread of data, and we're gonna start off with looking at the terms ranging domain.

  • Then we're gonna move on to understanding what variants and standard deviation means.

  • And then finally, we'll look at co variants as well a correlation.

  • All right, so let's start off with the range and domain.

  • Now let's set up with the range.

  • So the range is basically the difference between the maximum and the minimum value in our data set.

  • So that's that's kind of simple to think about.

  • So let's just kind of go through this with a work example.

  • Let's set up a company in the town and This is the only company in the town and the owner of the company earns a salary of 200 K a year and then the employees, you know, they all have different salaries, but the lowest employees or maybe the part time workers.

  • They earn something like 50 k a year.

  • So we've got data on kind of ranging from 15 K up to 200 K.

  • And so our range is the difference between the maximum and the minimum value.

  • Nor do you know.

  • So we take 200 K and we subtract 15 k from it, and we've got a range of 185 K in salary.

  • So that's how big our salary can change.

  • So it if we started 15 k, it could go all the way up to 200 k.

  • So that's 185 k range of salary that people in this company can have.

  • All right.

  • And the domain is going to be the values that our data points can take on, or the region that our data points lion s o.

  • If we look at this example again, our domain is going to start at 15 k and go up to 200 k.

  • So what the domain defines it defines kind of starting and ending porn's or defined a section in our data.

  • And so, in this case, the domain would define, You know, we would start at 15 came and it would end at 200 k.

  • And when the domain tells us is that everything or all salaries within, you know, between 15 k and 200 K that they're possible.

  • But within this domain or within this company, it's not possible to have salaries outside of the bluestone name.

  • So if our domain again this 15 K to 200 came, then we can't have a salary of 14 K because that's outside of our domain.

  • And we also can't have a salary of 205 K because again, that's outside of our domain.

  • So pretty much all salaries within 15 to 200 K are possible.

  • Anything outside of the domain is not possible because that's no longer in our domain.

  • All right, so let's move on and look att, the variance and standard deviation, and we'll talk about the variance first.

  • Um, and what the variants tells us it pretty much tells us how much our data differs from the mean value, and it looks at each mean value, and it looks at how different each value is from the mean value.

  • And then it gives us the variance, a Dustin calculation, and we don't really need to know the formula.

  • It's more important right now just to understand the concept of Arians.

  • And so what it variants really tells us is that tells us how much our data can fluctuate.

  • So if we have a high variance, that means a lot of our values differ greatly from the mean value, and that would make our variance bigger.

  • If we have a low variance, that means a lot of our values, our very close to the mean value and so that will make our variance lower.

  • And now, if we turn to the standard deviation, the standard deviation is literally just the square root of the variance.

  • So if you understand one, then you also understand the other.

  • And now we can combine this if we know the range of our data to kind of get a better feel for datum.

  • And so let's use an example where we have two different countries, just countries A and B and they have the same mean height for women, which in this case will say is 165 centimeters or five feet form on will say that the range of heights for them could be identical.

  • So let's say they can range.

  • You know the range, let's say, could be like 30 centimeters or something go anywhere from, say, 1 50 all the way up to 80.

  • Or we can even increase that and say, like anywhere from his low is 1 40 up to like two meters or something like that.

  • But let's just keep the range for these the same and they both have the mean height.

  • Now, if country A has a standard deviation of five centimeters, which is approximately two inches, and country B has a standard deviation of 10 centimeters, which is approximately four interests than what you can expect, knowing these values is that if you go into country, eh, the people that you're going to see are gonna be much more similar and heights.

  • So our standard deviation is lower.

  • That means our values differ lower from the mean and So that means a lot of the women that you're going to see are going to be very close to 165 centimeters or five feet four plus minus two inches.

  • So it's very what you can expect when you go to this company.

  • And when you go to this country is that everyone is gonna be or every lot of the women are gonna be about that height.

  • Whereas if you go to country be, they have a much larger standard deviation.

  • And so you can't really expect everyone to be about 54 because it fluctuates a lot more.

  • And so if you go to that country, you can expect to see a lot more women of different heights, both taller and shorter.

  • Then 54 all right.

  • And so that's how we can kind of use the variance in the standard deviation or the standard deviation to give us a little bit more perspective on our data and kind of allow us to and first some stuff about our data.

  • All right, so let's talk about co variance and correlation, and so Cove Arians will or already has the name very incident but co variances measured between two different variables and it pretty much measures if you have to valuables.

  • So let's say we've got you know, me drinking coffee in the morning and my general tiredness.

  • So if I used these to values and, you know, get data point So this is how much coffee I drink in the morning And this is how tired I feel this morning or something like that.

  • And so what the CO variants does is it looks at how much one of these values differs or changes when I change the other one.

  • So what does that mean?

  • For example?

  • Well, if I drink more coffee, what the co variants would look at is how much does my tiredness change?

  • So that's what you do with co variance.

  • You see, you say I change one.

  • How much does that affect?

  • The other thing that I look at andare correlation is very similar to co variance.

  • So we kind of normalize the co variance by dividing by the standard deviation of each variable.

  • So what that means is we get the co variance for my drinking coffee versus feeling tired, and then we would just divide by the standard deviation of metric and coffee and a standard deviation of me feeling tired.

  • And so really, what we're doing with the correlation is we're just kind of bringing it down to relative terms that would fit our data better.

  • So that's kind of the abstract idea.

  • And the important thing to just keep in mind is that we're looking at one, and we're seeing how much that changes.

  • And we're seeing how much that changed effects.

  • The other one, Um, all right, so they're different types of correlation values that we can have, and they can range anywhere between negative one and one or so.

  • Their domain is between negative one and one, and a correlation of one means a perfect positive correlation.

  • So that means when one variable goes up, the other goes up.

  • So for my coffee example, that would be if I have coffee in the morning, then I also feel more happy.

  • So the more coffee I have more happy I feel.

  • And of course, there's gonna be a limit.

  • But let's say I only drink up to two cups of coffee or something like that, and I can drink anything in between, and the more I have, the more happy I am about it.

  • So that would be a positive correlation.

  • The more I have of coffee, the Maura half of happiness, and so they would kind of go up together, and then we get closer to zero.

  • Um, zero point is gonna mean no correlation to us.

  • So anything between zero and one is going to be a kind of slightly positive correlation.

  • It's not gonna be a super strong, and we'll actually see some examples on the next line, but yeah, So anything between zero and one is gonna be a kind of slight positive correlation, not superstrong.

  • And then the closer you get to zero, the more it means no correlation.

  • So an example for the zero case would be that it doesn't matter how much coffee I drink in the morning.

  • It's not gonna affect the weather.

  • They're unrelated.

  • One does not affect the other.

  • So I could drink, You know, one cup of coffee during a sunny day and one cup of coffee during the rainy day.

  • And it's not going to change the weather.

everyone.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

A2 初級

數據科學入門--新手速成班 (Intro to Data Science - Crash Course for Beginners)

  • 0 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字