數據分析2：數據可視化 - Computerphile (Data Analysis 2: Data Visualisation - Computerphile)

字幕列表影片播放

Let's talk about data visualization so that we can avoid problems like this which is where we've got some kind of graph
Who knows what it means?
Loads and loads of lines none of them labeled. I think the thick one is more important. That's that's what I've learned from this
Data visualization is another method we can use along with
Statistics to have a look at our data Explorer our data and try and work out what's going on
It's a way of trying to understand our data better so that we can then perform
You know more rigorous statistical tests or actually start to draw conclusions or model our data
It's a very important tool but you've got to use it properly
You can't just plot anything and everything
Every chart you use has got to support your hypothesis or it's got to try and show the story
You're trying to tell right? You don't just plot something because it could be plotted. There's got to be a point to it
There's a lot of problems with using inappropriate grass and only picking subsets of your data. That's a huge problem, right?
That is not just a problem for data visualization. That's a problem for your statistical test as well
If you're only using some of your data, it's that okay
It's going to depend on the situation right my um, you know
but I think there's a strong argument for saying you've got to be really really careful and you've got to be really
structured and regimented and
Document everything you do. The core problem with visualization is that people just plot stuff and they do it badly
maybe they use the inappropriate plot type or they
Don't scale of axes properly and that leads to huge misunderstandings and actually can be quite misleading, right?
This happens a lot in the media
So, for example, you might get a sort of political message for your door, but says these are different parties
So this is party one
This is party to this is party three and maybe you know party one's got this many votes and party twos got
This many votes and party three two
Right down here and party two are trying to make the case that just a few more votes and they're gonna win in this area
why but actually written down here this is twenty thousand and this is ten thousand and this is you know,
Eight thousand and just in the small labeling they've got here
They've completely skewed the axis right ten thousand is half of twenty thousand yet. Here. We are up here if you misuse plots
It's actually misleading when it's on your own data
You're going to draw the wrong conclusions and then spend
quite a while researching into an area but doesn't make sense or and ends up in failure or if it's if
It's something you're presented to someone else. You can mislead that person whether intentionally or by accident
And that's never a good thing. I'm back in our and I just wanted to show a couple of plots that you know
It's not misleading necessarily, but you can easily infer the wrong kind of information, right so
There's this websites online
You can go to to look at the ratings for different TV shows right now. One of my favorite TV shows is Fraser, right?
I think it's amazing and
If you go on to these sites and you plot the
Ratings for all these Fraser episodes. It's all over the place
Sometimes it's very highly regarded and sometimes it's not so I'm just going to plot this
using the GG plot tool and we can see if we look at the graph that
It's absolutely everywhere. Right? You've got good episodes. You've got bad episodes and it seems to maybe be going slightly downhill towards the end
But it's difficult to say right because it's all over the place
Now what's actually happened is I've just plotted using a default function and it's Auto scaled my rating axis, right?
so my y-axis is the rating of the episodes and it's going between seven and
About nine and a half now that isn't representative because it's spreading out my data if I plot the exact same data
But this time from naught to ten like an actual rating system
You can see that most episodes get almost the exact same rating somewhere between around seven and a half to eight
Which I think's pretty good
I would rate them a 10, but you know
It's just me. You can see that even if you're not careful
If you do it by accident, even auto-scaling a maxi's and things like this can cause a real problem another classic example, you'll see
In the news is when they show something like a currency exchange rate
So if we look at here
we've got our I've downloaded some sample data of the Japanese yen versus the US dollar and I've simplified this by
Extracting just a period of about 60 days in the middle of some time
I can't remember exactly what it is
If we plot this you can see that actually there's a big sort of cliff edge
Something terrible has happened around day 30 and the value of the Japanese yen is just plummeting
And of course, this is absolute nonsense, right? Because this scale goes between 108 and a hundred and fourteen
And so if we plot it with a proper axes on you can see that actually it's almost completely flat
If your business relies on the exchange rate of a Japanese yen to the US dollar
Obviously these small changes might be important right but if you're presenting this in the news
It's very easy to claim that something terrible's happened when in fact actually, maybe this is just normal blip up and down, right so
You can misuse
Plots to serve your purpose right or and you can do it accidentally and waste a huge amount of time
Let's have a look at the standard plots
You might see right and you could use on a very basic level and see you know
What are they appropriate for right because one of the most important things is that you use these plots and these charts
Appropriately, alright, so, you know, perhaps the most common one that everyone sees is going to be a bar chart
You've got two axes
You've got some kind of attributes or labels down here and then you've got some quantity or amount of some attribute here
And then you're going to have different bars like this now
This is a very nice graph to use it's simple but it's effective because you can very easily see what the difference between these different
Levels are right so that you know, it's often going to be your go to graph for lots of things
Right, some people now some people try and replace this graph of a pie chart, right? This is a bad idea in general
I mean
I like pie as much as the next person but if you've got different things
Like this and one of them is big
I mean you can see that this one's bigger than this one, but how much bigger it is?
I don't know
You can't see the relative sizes quite so easily this all gets worse if you combine this into a doughnut plot
And then you've got multiple pies embedded in each other none of them align and nothing makes any sense anymore, right?
So if in doubt don't use a pie chart, it's a bad idea. I mean they look very nice for presentations
That's about what I can say for it if we're going to be measuring some call of quantity then a bar charts going to be
What we want right but what we might also do is replace quantity with the with the frequency or the amount of something
So this is gonna be frequency. This is also our labels again on the bottom here
We've got our labels and this is going to be bins for some single attribute
So this is maybe so naught to 10 that misses maybe 10 to 20 of whatever the thing is
And this is a frequency the amount that fall into that range and what this allows us to do is work out very easily
What the distribution is is it normally distributed, but I'm only distributed with two peaks, you know
Is it suitable left skewed to the right?
We can see very easily the shape of our data and it can be really helpful
Another way of looking at this sort of the shape or the range of our data in particular is a box plot right now
You'll see box plots come up from time to time with scientific
Documents but they're very easy to produce in tools like are and they can be quite useful
So here we're gonna have a single attribute
So some label again or some attribute here and this is going to be the quantity of this attribute
And what a boxplot does is label the range of that data
So we're going to have a box here like this and it's going to look a little bit like this
So I'll use a different color pen
This line in the center is our median typically and then this is going to be the third quartile here
Third quartile and this is going to be the first quartile and then these are the max and the min in this one plot
We've got the absolute range of our data
We've got where 50% of our data is sort of this interquartile range here and we know where the midpoint of our data is
So we can very easily see whether we've got
outliers and we can plot this next to a different attribute and we can have two box plots next to each other and we can
See very quickly, you know a comparison between these two things so that can be really useful now the final ones right?
We're going to be talking about scatter plots and trend lines. All right, so it's got to pop very simple. We've got two
Attributes, this is attribute one and this is attribute two, and we want to see how they bury with respect to each other
So when one goes up does the other one go up or does it go down are they even related to?
So you'll see something like this and it'd be all over the place often
But you can see maybe there's a kind of trend where as attribute one increases attribute two increases right now
This is a correlation being shown here. Not a causation. So you can't say they're definitely related, but you can say that
generally speaking when one is big so is the other that's but sometimes useful a
Trendline is going to be where we're going to be plotting something over time
My so this has to be a continuous variable or at least a variable we believe
Can be inferred between our points like it's unlikely, but you're gonna have all the points
So you what you might have is you might have a plot where you've got time
Down here. So maybe time in mumps, for example
And we've got some amount of something and we're just going to plot it like this and we can sort of have a trendline going
Like this if it's a situation where we can infer the amount between two time points then this is okay
Right because we can say well look we've got a reading here. We've got a reading here
It's reasonable to assume that between these two points. This is the amount
All right. Nothing to funny's gone on between these two points, right?
If you can't assume that then you shouldn't really be using a trendline and you probably want to be using a bar graph
Does that depend on the kind of day to them? Yes, it'll depend on it
This is a judgment call based on the kind of data
So if a data I mean time is a good good example. We don't tend to measure sort of in infinitely small increments
We're going to be measuring daily or hourly or something like this
but we can kind of make an assumption a lot of the time that our readings like temperature for example over time if
You're at 20 and then the next hour you're at 25. We're probably halfway between there to between those two times, right?
It's going to depend on your data
I mean a good example would be if you were plotting something like operating system usage per student
so we've got OS X here, but Linux here and we've got
Windows these many people use OS X this many people uses Linux this many people use Windows
Well bees have discrete data points. You can't fit a trend line to these. There is no operating system
That's 50% between Linux and Windows that I know of and we can't infer
How many students are going to be using it that makes no sense? That should be a bar chart?
So let's look at an actual data set and see how we can use some of this visualization in practice
So I've got here a chicken data set and this data set is about
Weighing chickens on different diets over a period of weeks and also measuring how many eggs they produced
I'm not a farmer, but let's imagine that what we wanted to do was see if one of these
Diets produces a better weight gain and maybe more eggs per week. Let's have a look
So I'm going to load the chicken data set. This is at stored in a CSV
Just like before let's have a quick look at just the first few rows of this data to see what they look like
So that's going to be the head function and we you can see we've got six attributes
So we've got the week but the measurement was taken the chicken in this case of chicken number one, but they'll obviously be other chickens
diet, they're on a diet B or diet see the age of the chicken in mumps the weight of a chicken in kilograms and the
Number of eggs they produce that week. All right, so there's going to be lots of combinations of weeks and chickens in this data set
Now what we want to try and do is see if there's any kind of relationship between the diet
They're on and the number of eggs. They're producing or the weight of a chicken or anything like this
So the first thing we could do is we could have a look at the aggregate function
So I'm going to paste this down here. We'll talk through it. What the aggregate function does is let us produce
Let's say a summary or calculate some means or medians
Over a data set but this time grouping by a certain attribute
so in this case
What we're going to do is we're going to aggregate the weight of the chickens bar in groups of their diet
So all the A's all the B's and all the C's and then we're gonna for each of those
We're going to calculate a summary
So let's run that and you can see that we've got our group down here for a we've got the minimum the maximum
The median the mean and we can see some slight differences perhaps in these data sets
I mean the median mean for example of Group A. It's 3.8. Whereas the mean for Group C is 3.4
So maybe there's a slight difference in these things. Okay. So let's try a different aggregate function
So this time we're going to aggregate the number of eggs produced groups by again the diet
So this is going to be all the A's all the B's and all the Seas and then we're going to produce a summary
so we can see that the median number of eggs produced for group a is 4 per week and
For group B and Group C is 3 per week. So maybe again there's a slight difference
We're starting to learn a little bit about our data. So let's start with histogram light
So what we're gonna do we're gonna use this histogram function
Which is mostly labels like the hist function in our produces a histogram
And we're going to produce a histogram of the ages of a chickens. So what's the distribution of the ages?
Are they old are they young?
And we're gonna use 15 breaks
That means we're going to take the whole range and break it into 15 columns 15 bands right now
actually, I will do a little bit of
Just a few checks behind the scenes to make sure 15 is an appropriate number and might adjust it up or down slightly
so we can see this histogram broadly speaking our
Chickens are evenly distributed among the different ages
we've got some young ones that sort of 60 or 70 weeks old older ones that are
350 weeks old and then for some reason we've got a peak around 250
I don't know why that is but I maybe we've got a batch of a certain age of chickens in
And let's finally let's look at the box plot
So we talked about the block's plot box plot will tell us the minimum the maximum
For an attribute and also the median in the range, right? So this is really helpful
So we're just going to have a look just to age just for all chickens
So you can see that the median is around 220 something like that
and then the majority of the chickens, so 50% of the chickens fall between about
150 weeks old and 300 weeks old but you can see there are some very young ones and some very old ones this kind of
Plot will end. It's really size up where our data sits before we start to make any assumptions
so let's imagine now that we want to try and drill down into his day to a bit and work out whether
Actually the diet had any effect on the number of eggs or the weight of a chicken, right?
so what we're going to do is we're going to group we're going to use the aggregate function again to calculate the means of
All the weights per week. I was going to copy that down here
So we're going to say aggregate the weight of the chickens by both the week and the diet
so
combinations a week one
die a week to die a and so on and I don't want you to calculate the mean for all chickens, so
Run that so that produces some statistics on the different average weight of chickens over time
I'm going to rename the columns so that they're a little bit more informative that sort of run that line there
And then finally, we're going to plot this now
We're going to use GG plot for this, you know, whether you use the inbuilt our plot functions or enough alive
We like GG plot will kind of depend on what plot you want to do in general
You can get quite nice plots with GG plot, but they're a little bit more involved. Alright, so I'm going to run this line here
Looking at this data we can kind of see that maybe da a is having a positive effect, right?
So down at the bottom where no weeks are passed at the beginning of our experiment
There were roughly the same weight and then the average weight of a actually does seem to increase
So I guess that's something interesting about our data right now
Let's look at number of eggs, right so we're gonna do the same thing this time
We're going to aggregate the number of eggs by week and by diet so they don't copy that and I'm going to give it some
Helpful labels as well and then we're going to put the data. Let's see
Over time whether or not any of the diets have any effect on the eggs, and it's looking pretty good
Alright, so this is the frequency as the number of eggs were producing
the weeks is the twelve weeks of our
Experiment and you can see that diet B and IC produce roughly the same number of eggs per week
This is averaged over all the chickens but diet a produces at least an egg more per week on average
You know, that's a 20% increase
Roughly speaking. If you're if you're a farmer, that's a great thing
But the problem we've got is that this might be a little bit too good to be true
What we're seeing here is perhaps an issue of correlation versus causation
So we can see here that there is a correlation between the diet that's being used and the number of eggs
but we don't know but it's the diet specifically that causes it we're looking more detail at the ages of the chickens specifically because I'm
Interested to know him whether or not Paul older chickens produce more or fewer eggs
Right because that could be relevant to our to our experiment
Okay, so we're going to group the chickens up by diet and then work out what their average age is so mean age
So I'm going to calculate this
On this here, and then I'm going to look at it
And we can see that the average age the mean age for Group A or these chickens on diet a is only 156 weeks
but the age for let's say Group C is
248 weeks are significantly older. All right, so we need to just check that
This isn't going to be an issue for the number of eggs laid
So let's plot the number of eggs versus the age of the chickens, right? So here we're going to
Generate a scatterplot of age versus the number of eggs
But we're also going to color by diet so we can see roughly where the different diets sits. Let's run this
Okay, so what we can see is but actually as chickens get older we do see a quite serious decrease in the number of eggs
Produced per week from about four and a half hour wage down to about two and a half or two average, right?
And also we can see that IAE is predominately sitting up here, which means that the chickens are younger
So this could be a problem
What we're saying is that it could be that we happen to have put a load of young chickens on diet a and yes
They're producing more eggs, but that isn't because of die a that's because they're younger, right?
So let's have a look at a box plot of the age of chickens per diet and you can see that they're significantly younger on
diet a so
I think the conclusion we can draw is but it's theoretically possible that there's a link between the diet and the number of eggs produced
But we can't really say it from this data. We're going to need a lot more data. Maybe some you know some more chickens
I like to try and work this out. We've seen a number of different visualizations and the important thing is that we use visualizations
Appropriately and we don't make assumptions about our data
So we're going to start to look at cleaning of data and then maybe using our data in clustering and classification
but
Visualization is a really good way to start off exploring your data and generate some initial hypotheses
Well, we're looking at chocolate datasets today, so I thought I'd bring some research
Yeah, good and definitely relevant