數據分析3：清理數據 - Computerphile (Data Analysis 3: Cleaning Data - Computerphile)

字幕列表影片播放

Well, we're looking at chocolate datasets today, so I thought I'd bring some research I'm
Yeah, good and definitely relevant
We've been looking at techniques like data visualization to try and explore our data and start to draw some initial
You know conclusions or hypotheses
we're going to start to move towards kind of modeling our data and actually trying to extract proper knowledge from this data because remember just
Because we've got data doesn't mean we've got knowledge. Now. This is going to be a kind of iterative process
We're going to need to clean up our data to make it as useful as possible
we need to transform it so that we can combine datasets together and
Statistically measure our datasets and then we're going to need to reduce it
sometimes if our data set is too big and unmanageable and this combination of cleaning data and
Transforming data and reducing data is a kind of cycle where we're going to iterate this until our data set
It's a smallest most useful form
It can be so if we've got redundant variables which are basically the same as others or we've got duplicates
These are all problems that we need to sort out because if we're going to be analyzing data with these kinds of issues
We're just making our life slightly more difficult
It's computationally inefficient and you know in the worst case we could draw the wrong conclusions
you might be surprised and disappointed when you get your first data set that actually it's not quite as nice as you were hoping and
It's gonna need some cleaning up. Maybe there's missing values
Maybe there's outliers that need to be dealt with because this yeah
they're warping your distributions and your and your medians and means
Right and perhaps you also got noise in your data set right these few things we can start to address with cleaning
So cleaning data is essentially the idea of trying to correct or fill in any missing
Values or remove those bits completely you might be surprised but there's missing data at all
I mean Oh, what are we not paying attention?
Like we've got one job and that was to collect the data and we seem to have got missing data
But actually it's quite common because I mean, for example if you're if you're trying to track patient records over time
Maybe we didn't show up to an appointment or maybe you in a hospital when they weren't there when they needed to have their temperature
Taken, right and then your trend line of your temperature over time is going to be missing some values
maybe you've got data going back for hundreds of years and they didn't have certain techniques and certain measurement systems back then so they only
Have other kinds of data so missing data is we common rye and we're gonna have to be able to deal with it
So the data set we're looking at today is a kind of ratings for chocolate bars
This is why ate my chocolate or at least that's why I'm telling myself
So we'll read the data in we've got ten different variables
We've got about 1,700 nearly observations and let's have a quick look using the summary
So we've got things like the company would produce for chocolate the name of the chocolate
Reviews the cocoa percentage the type of being this kind of information
Right and you can imagine what you might do if you were trying to produce better chocolate
It's a look at a huge amount of this kind of data and work out what it is that customers like and what it is
They don't like but this is going to be quite common in
market research
So the first thing we're going to do right we've we've received this data
We know now what the columns are, but we don't really know anything else other than this
So we're going to have to start looking through and seeing first of all, is there any missing data?
So we're going to use the S supply function for this the s supply function will apply a function
Over our data set so for each column or each attribute of our data
We're going to apply this and the function we're going to use we're lighting ourselves
so it's going to be the sum for any time where our item is either blank or
Na now blank means an empty string and na means literally not applicable
Right, which is something that comes up in data from time to time
Alright, so in any case both of these are missing values
And so we're going to treat them both the same
So if we apply this to our chocolate data set then we're going to see that for example, there are eight missing names
There are seven missing review dates and there are four missing Coco percentages
So for each row in our data, there are four rows where the Coco percent is missing right now
That's not too bad for I mean, this is as data set of nearly 1,700 items for it's not too bad
That's quite quite expected
You might imagine that if you're pooling this data from lots of different sources
People are going to forget to add datum in or they weren't able to record data on that day
There's a huge amount of reasons
Why you might have missing data and now it starts to become a little bit of a problem when we look at things like bean
Type because bean type has got 1200 missing values
That's a huge proportion of a data set. And in that case we might have to do something about this
So the only issue we've got is that 1200 is not relish. It's just an amount of rows
It's not relative to the of a data set. So we're going to use the exact same function any empty rows
but this time we're going to divide by the total number of rows so we can get a
percentage for how much of a missing data we've got so we can see for example
But company name has zero missing data. Whereas beam type has 74 percent missing data
So that's a huge problem
Now a kind of general rule of thumb is if you've got over half your data is missing
It's going to be quite hard to estimate or guess what that data is going to be
That's when you might want to start thinking about removing it. So what we want to do is we want to extract
Any of the names of any of our attributes that have this sort of over let's say 60% missing
All right
So we're going to start by calculating always percentages and saving them in a variable and then we're going to select only those percentages
Where the value is over point 6 right 60 percent?
so we're gonna say any attribute where the attribute is over point 6 and that is just being typed at point 7 for right or
74% so we can now delete beam type ourselves so we could say something like chaco all the rows
for being type is
Null and by setting that to nah, that's just going to delete that column
We can also do it automatically so we could actually pass in those attributes that we just calculated as a parameter
So that would be this line here
So be something like chaco all rows that's here the names of any attributes where the percentage
Missing is greater than 0.6. And that's going to just delete being type right? It's not a lot we can do about being type
We've only got 25% issue of a data. It's not enough to start guessing
What beam types are going to be in other chocolate bars?
Let's have a look at now our rows of data voters for each instance
They're going to have a number of attributes now
There's 9 left and we want really to keep the instances that have the majority of their data, right?
So we're gonna apply my so this is going to be row wise to dimension 1
So that's the rows we're gonna count any of it a blank or n/a?
For each row over our data set and we're going to put this into missing
so what it is going to do is return a
List of values for every single row but tells you how many missing items are there in that rug so we can now look
at the first few missing
items, so we're going to order them right by larges first and then we're going to show just the first few and you can see
But actually some of them are missing seven and six attributes. That's quite a serious situation
Because it was only nine lights only only got a couple of entries in their fields
Now let's do this again as a percentage of the number of attributes
So this is exactly the same thing but this time we're dividing by the number of columns
which is nine and we're going to have a look at the top of these and so you can see that we're missing 77% of
Some of these initial attributes that's a real problem
Missing is the same length as the number of rows we've got so we can actually look up any
Rows where there's a greater percentage of missing values and we want and just remove them from the data set
So what we're going to do that is a bit like this. So there's a choco is
Choco, anywhere. We're
Missing is less than or point seven and then all the columns and what that's going to do is select only the rows we want
Where they've got a nice amount of data
So machaca dataset is going to be a little bit smaller now, but much more useful to us
We don't really want to be trying to do things like machine learning or statistics when 70% of some of the data is missing
Right, that isn't going to be a good idea. So it's quite easy just to delete data, right?
I mean in some sense, it's just more convenient to do that in general
The rule is that if you've got more than 50 or 60% missing data, it's a good idea to delete it, right?
Delete are the instances or the attributes depending on how much data you've got missing and where if you're missing a huge amount of data
Then you're not going to be able to recreate it by let's say using an average, right?
We've got so little later that an average isn't going to be reliable if we have got sufficient data that we could maybe start to
Infer what these missing values might be we can start to try and replace them instead of deleting them
So what we might do is we might for example set them all to zero
Maybe if an attribute is missing we can say well, okay
If it's missing, we'll just not use it and we'll say it's zero now whether you do that
It's going to depend on what the attribute is. Something zero is not a useful property
Right and we'll look at that in the chocolate dataset in a moment
What we might also do is we might start to add the data set mean into those attributes
So maybe we don't know what the rating for this chocolate bar is but we can guess but it's going to be around the average
Rating for any chocolate bar again. This is going to depend on your situation, right? You're still making up datum in some sense
You've got to be very careful about what you do here. So
We've deleted as much of our chakra data as we feel comfortable doing now. Let's see if we can
Fill in some of missing values with appropriate replacements. So let's have a look at our attributes
Alright, so we've got company name reference things like this being tab has been removed
But we still got things like the bean origin and the ratings and there's a few of these missing from our data set
Can we estimate these rather than completely removing them from the data set?
obviously the less data you use for less useful things like machine learning are going to be so let's look at an
Attribute and see what we can do
So if we look at bar price, I bats the price of each chocolate bar
We can see that there's a few missing values somewhere around 3% That's something we want to deal with
But we've got enough data, you know 97% Maybe we can start to guess what the prices of these chocolate bars might be
Now this is a good instance of a time when you wouldn't want to just populate with zeros, right? No chocolate bar is free
I wish and so what we need to do is produce a reliable
Value to represent an unknown price rather than just setting them all to be zero
so what we could do here is something like this we could set every missing bar price to be the average of all the
Chocolate bar prices and that way at least we're not warping our distribution up or down
We're keeping it exactly the same. We're gonna say for the chocolate data set for any row
Where bar price is n/a and for all columns?
We're going to set the bar price to be
The mean of all the bar prices and we're gonna obviously remove any na s from that calculation of what it's not going to work
And that's already worked. So now if we recalculate our missing values, you'll see that bar price now has zero missing values
So we've fixed that problem great. So that was quite an easy one
Right bar pie seems to me to be quite an intuitive time when you would just calculate an average and put it in right now
actually, maybe not because
You know bar price might depend on where in the world we're selling it or you know, what company is producing the chocolate bar
So could we do the same thing for rating if we look if we take the sum of all the NA values in rating?
It's eight like so there are eight chocolate bars for which there is no rating
So what we can do is we could do something called a stratified replacement
We could say well, let's group our chocolate bars by country or by company calculate those averages and then we can specifically fill in
Companies missing ratings based on what that company actually show in the market rather than just an average over everything
So what we're going to do is we're going to calculate an aggregate function over of the ratings by company
And we're going to calculate a median median is a little bit more robust to outliers
So maybe you make up a very very expensive or very very cheap line. The median will get what middle value is, right?
So this is going to be per company and we can set the columns to be a little bit more helpful
using coal names and so now our per company if we look at it is each company and the median rating of
Chocolate bar form I think one to five your fingers data set is going so now we know that data per company
We can actually fill those in now. You could automate this process. We don't have much missing data
So let's just show an example of doing it by hand
So this is the line of code we're going to do and I'll talk through it
So we're going to say for the chocolate data set for any value where the chocolate rating is n/a for missing, right?
And the company is Vai Kuna
We want to set the rating to be equal to
The vikon entry in our new per company average or median and that's going to fill in
That value there
so we do this for all the missing companies and what we're going to find is that we've replaced all our missing values with
Appropriate medians for those ratings per company. So the last thing we might find in our data is outliers
So let's imagine we do a box plot of cocoa percentage, right?
So I'm going to produce a box plot of a cocoa percentage now, maybe our assumption is that cocoa percentage in some way?
Informs what the rating is going to be because maybe a higher cocoa percentage tastes nicer. I don't really know about chocolate
So if we look at this box plot
what we'll see is we've got actually quite a tight distribution of cocoa percentage right between about
50% and just above 80%
But you can see there are three outliers when it produces a box plot
Or I will show outliers is anything that is more than three standard?
Deviations away from the median what we do with these outliers is going to be a judgment call
It's going to depend on a situation
So, for example, we have an outlier here, which is above a hundred percent now that makes no sense
We can't have a chocolate bar with more than a hundred percent. Cocoa, right doesn't make sense
So that is obviously a mistake. We would delete that item right and probably delete the whole row or we estimate that value
Based on a stratified average or a different average for these lower ones
This is a judgment call one is just above 20 and one is up closer to 30%
I don't know whether those of outliers or not, right? Is it possible to make a viable chocolate bar with 20% cocoa?
I mean it may be right you're going to have to know a little bit something about your the situation
But your data was collected in and whether that's reasonable so you might for example, delete the bottom one as a mistake
But keep the top one because that's just a low amount of cocoa. So this is what cleaning data is about
We're going to have missing data
We're going to have outliers
We might have noise and you're going to have to look for your data and try and condense it and remove all these problems
We do this so that we can more effectively
Transform our data later and also reduce our data if we need to and then eventually your datasets going to be really nice
So that we can do things like modeling or machine learning on it
My fuel economy is messaging miles per gallon, but of course, I don't pump fuel in my in gallons
I pump it in liters
And then but when I run anywhere so short distances I run in kilometers and I run in kilometers per hour
So I'm using two different systems there