字幕列表 影片播放
Let's talk about data visualization so that we can avoid problems like this which is where we've got some kind of graph
Who knows what it means?
Loads and loads of lines none of them labeled. I think the thick one is more important. That's that's what I've learned from this
Data visualization is another method we can use along with
Statistics to have a look at our data Explorer our data and try and work out what's going on
It's a way of trying to understand our data better so that we can then perform
You know more rigorous statistical tests or actually start to draw conclusions or model our data
It's a very important tool but you've got to use it properly
You can't just plot anything and everything
Every chart you use has got to support your hypothesis or it's got to try and show the story
You're trying to tell right? You don't just plot something because it could be plotted. There's got to be a point to it
There's a lot of problems with using inappropriate grass and only picking subsets of your data. That's a huge problem, right?
That is not just a problem for data visualization. That's a problem for your statistical test as well
If you're only using some of your data, it's that okay
It's going to depend on the situation right my um, you know
but I think there's a strong argument for saying you've got to be really really careful and you've got to be really
structured and regimented and
Document everything you do. The core problem with visualization is that people just plot stuff and they do it badly
maybe they use the inappropriate plot type or they
Don't scale of axes properly and that leads to huge misunderstandings and actually can be quite misleading, right?
This happens a lot in the media
So, for example, you might get a sort of political message for your door, but says these are different parties
So this is party one
This is party to this is party three and maybe you know party one's got this many votes and party twos got
This many votes and party three two
Right down here and party two are trying to make the case that just a few more votes and they're gonna win in this area
why but actually written down here this is twenty thousand and this is ten thousand and this is you know,
Eight thousand and just in the small labeling they've got here
They've completely skewed the axis right ten thousand is half of twenty thousand yet. Here. We are up here if you misuse plots
It's actually misleading when it's on your own data
You're going to draw the wrong conclusions and then spend
quite a while researching into an area but doesn't make sense or and ends up in failure or if it's if
It's something you're presented to someone else. You can mislead that person whether intentionally or by accident
And that's never a good thing. I'm back in our and I just wanted to show a couple of plots that you know
It's not misleading necessarily, but you can easily infer the wrong kind of information, right so
There's this websites online
You can go to to look at the ratings for different TV shows right now. One of my favorite TV shows is Fraser, right?
I think it's amazing and
If you go on to these sites and you plot the
Ratings for all these Fraser episodes. It's all over the place
Sometimes it's very highly regarded and sometimes it's not so I'm just going to plot this
using the GG plot tool and we can see if we look at the graph that
It's absolutely everywhere. Right? You've got good episodes. You've got bad episodes and it seems to maybe be going slightly downhill towards the end
But it's difficult to say right because it's all over the place
Now what's actually happened is I've just plotted using a default function and it's Auto scaled my rating axis, right?
so my y-axis is the rating of the episodes and it's going between seven and
About nine and a half now that isn't representative because it's spreading out my data if I plot the exact same data
But this time from naught to ten like an actual rating system
You can see that most episodes get almost the exact same rating somewhere between around seven and a half to eight
Which I think's pretty good
I would rate them a 10, but you know
It's just me. You can see that even if you're not careful
If you do it by accident, even auto-scaling a maxi's and things like this can cause a real problem another classic example, you'll see
In the news is when they show something like a currency exchange rate
So if we look at here
we've got our I've downloaded some sample data of the Japanese yen versus the US dollar and I've simplified this by
Extracting just a period of about 60 days in the middle of some time
I can't remember exactly what it is
If we plot this you can see that actually there's a big sort of cliff edge
Something terrible has happened around day 30 and the value of the Japanese yen is just plummeting
And of course, this is absolute nonsense, right? Because this scale goes between 108 and a hundred and fourteen
And so if we plot it with a proper axes on you can see that actually it's almost completely flat
If your business relies on the exchange rate of a Japanese yen to the US dollar
Obviously these small changes might be important right but if you're presenting this in the news
It's very easy to claim that something terrible's happened when in fact actually, maybe this is just normal blip up and down, right so
You can misuse
Plots to serve your purpose right or and you can do it accidentally and waste a huge amount of time
Let's have a look at the standard plots
You might see right and you could use on a very basic level and see you know
What are they appropriate for right because one of the most important things is that you use these plots and these charts
Appropriately, alright, so, you know, perhaps the most common one that everyone sees is going to be a bar chart
You've got two axes
You've got some kind of attributes or labels down here and then you've got some quantity or amount of some attribute here
And then you're going to have different bars like this now
This is a very nice graph to use it's simple but it's effective because you can very easily see what the difference between these different
Levels are right so that you know, it's often going to be your go to graph for lots of things
Right, some people now some people try and replace this graph of a pie chart, right? This is a bad idea in general
I mean
I like pie as much as the next person but if you've got different things
Like this and one of them is big
I mean you can see that this one's bigger than this one, but how much bigger it is?
I don't know
You can't see the relative sizes quite so easily this all gets worse if you combine this into a doughnut plot
And then you've got multiple pies embedded in each other none of them align and nothing makes any sense anymore, right?
So if in doubt don't use a pie chart, it's a bad idea. I mean they look very nice for presentations
That's about what I can say for it if we're going to be measuring some call of quantity then a bar charts going to be
What we want right but what we might also do is replace quantity with the with the frequency or the amount of something
So this is gonna be frequency. This is also our labels again on the bottom here
We've got our labels and this is going to be bins for some single attribute
So this is maybe so naught to 10 that misses maybe 10 to 20 of whatever the thing is
And this is a frequency the amount that fall into that range and what this allows us to do is work out very easily
What the distribution is is it normally distributed, but I'm only distributed with two peaks, you know
Is it suitable left skewed to the right?
We can see very easily the shape of our data and it can be really helpful
Another way of looking at this sort of the shape or the range of our data in particular is a box plot right now
You'll see box plots come up from time to time with scientific
Documents but they're very easy to produce in tools like are and they can be quite useful
So here we're gonna have a single attribute
So some label again or some attribute here and this is going to be the quantity of this attribute
And what a boxplot does is label the range of that data
So we're going to have a box here like this and it's going to look a little bit like this
So I'll use a different color pen
This line in the center is our median typically and then this is going to be the third quartile here
Third quartile and this is going to be the first quartile and then these are the max and the min in this one plot
We've got the absolute range of our data
We've got where 50% of our data is sort of this interquartile range here and we know where the midpoint of our data is
So we can very easily see whether we've got
outliers and we can plot this next to a different attribute and we can have two box plots next to each other and we can
See very quickly, you know a comparison between these two things so that can be really useful now the final ones right?
We're going to be talking about scatter plots and trend lines. All right, so it's got to pop very simple. We've got two
Attributes, this is attribute one and this is attribute two, and we want to see how they bury with respect to each other
So when one goes up does the other one go up or does it go down are they even related to?
So you'll see something like this and it'd be all over the place often
But you can see maybe there's a kind of trend where as attribute one increases attribute two increases right now
This is a correlation being shown here. Not a causation. So you can't say they're definitely related, but you can say that
generally speaking when one is big so is the other that's but sometimes useful a
Trendline is going to be where we're going to be plotting something over time
My so this has to be a continuous variable or at least a variable we believe
Can be inferred between our points like it's unlikely, but you're gonna have all the points
So you what you might have is you might have a plot where you've got time
Down here. So maybe time in mumps, for example
And we've got some amount of something and we're just going to plot it like this and we can sort of have a trendline going
Like this if it's a situation where we can infer the amount between two time points then this is okay
Right because we can say well look we've got a reading here. We've got a reading here
It's reasonable to assume that between these two points. This is the amount
All right. Nothing to funny's gone on between these two points, right?
If you can't assume that then you shouldn't really be using a trendline and you probably want to be using a bar graph
Does that depend on the kind of day to them? Yes, it'll depend on it
This is a judgment call based on the kind of data
So if a data I mean time is a good good example. We don't tend to measure sort of in infinitely small increments
We're going to be measuring daily or hourly or something like this
but we can kind of make an assumption a lot of the time that our readings like temperature for example over time if
You're at 20 and then the next hour you're at 25. We're probably halfway between there to between those two times, right?
It's going to depend on your data
I mean a good example would be if you were plotting something like operating system usage per student
so we've got OS X here, but Linux here and we've got
Windows these many people use OS X this many people uses Linux this many people use Windows
Well bees have discrete data points. You can't fit a trend line to these. There is no operating system
That's 50% between Linux and Windows that I know of and we can't infer
How many students are going to be using it that makes no sense? That should be a bar chart?
So let's look at an actual data set and see how we can use some of this visualization in practice
So I've got here a chicken data set and this data set is about
Weighing chickens on different diets over a period of weeks and also measuring how many eggs they produced
I'm not a farmer, but let's imagine that what we wanted to do was see if one of these
Diets produces a better weight gain and maybe more eggs per week. Let's have a look
So I'm going to load the chicken data set. This is at stored in a CSV
Just like before let's have a quick look at just the first few rows of this data to see what they look like
So that's going to be the head function and we you can see we've got six attributes
So we've got the week but the measurement was taken the chicken in this case of chicken number one, but they'll obviously be other chickens
diet, they're on a diet B or diet see the age of the chicken in mumps the weight of a chicken in kilograms and the
Number of eggs they produce that week. All right, so there's going to be lots of combinations of weeks and chickens in this data set