Placeholder Image

字幕列表 影片播放

  • Let's talk about data visualization so that we can avoid problems like this which is where we've got some kind of graph

  • Who knows what it means?

  • Loads and loads of lines none of them labeled. I think the thick one is more important. That's that's what I've learned from this

  • Data visualization is another method we can use along with

  • Statistics to have a look at our data Explorer our data and try and work out what's going on

  • It's a way of trying to understand our data better so that we can then perform

  • You know more rigorous statistical tests or actually start to draw conclusions or model our data

  • It's a very important tool but you've got to use it properly

  • You can't just plot anything and everything

  • Every chart you use has got to support your hypothesis or it's got to try and show the story

  • You're trying to tell right? You don't just plot something because it could be plotted. There's got to be a point to it

  • There's a lot of problems with using inappropriate grass and only picking subsets of your data. That's a huge problem, right?

  • That is not just a problem for data visualization. That's a problem for your statistical test as well

  • If you're only using some of your data, it's that okay

  • It's going to depend on the situation right my um, you know

  • but I think there's a strong argument for saying you've got to be really really careful and you've got to be really

  • structured and regimented and

  • Document everything you do. The core problem with visualization is that people just plot stuff and they do it badly

  • maybe they use the inappropriate plot type or they

  • Don't scale of axes properly and that leads to huge misunderstandings and actually can be quite misleading, right?

  • This happens a lot in the media

  • So, for example, you might get a sort of political message for your door, but says these are different parties

  • So this is party one

  • This is party to this is party three and maybe you know party one's got this many votes and party twos got

  • This many votes and party three two

  • Right down here and party two are trying to make the case that just a few more votes and they're gonna win in this area

  • why but actually written down here this is twenty thousand and this is ten thousand and this is you know,

  • Eight thousand and just in the small labeling they've got here

  • They've completely skewed the axis right ten thousand is half of twenty thousand yet. Here. We are up here if you misuse plots

  • It's actually misleading when it's on your own data

  • You're going to draw the wrong conclusions and then spend

  • quite a while researching into an area but doesn't make sense or and ends up in failure or if it's if

  • It's something you're presented to someone else. You can mislead that person whether intentionally or by accident

  • And that's never a good thing. I'm back in our and I just wanted to show a couple of plots that you know

  • It's not misleading necessarily, but you can easily infer the wrong kind of information, right so

  • There's this websites online

  • You can go to to look at the ratings for different TV shows right now. One of my favorite TV shows is Fraser, right?

  • I think it's amazing and

  • If you go on to these sites and you plot the

  • Ratings for all these Fraser episodes. It's all over the place

  • Sometimes it's very highly regarded and sometimes it's not so I'm just going to plot this

  • using the GG plot tool and we can see if we look at the graph that

  • It's absolutely everywhere. Right? You've got good episodes. You've got bad episodes and it seems to maybe be going slightly downhill towards the end

  • But it's difficult to say right because it's all over the place

  • Now what's actually happened is I've just plotted using a default function and it's Auto scaled my rating axis, right?

  • so my y-axis is the rating of the episodes and it's going between seven and

  • About nine and a half now that isn't representative because it's spreading out my data if I plot the exact same data

  • But this time from naught to ten like an actual rating system

  • You can see that most episodes get almost the exact same rating somewhere between around seven and a half to eight

  • Which I think's pretty good

  • I would rate them a 10, but you know

  • It's just me. You can see that even if you're not careful

  • If you do it by accident, even auto-scaling a maxi's and things like this can cause a real problem another classic example, you'll see

  • In the news is when they show something like a currency exchange rate

  • So if we look at here

  • we've got our I've downloaded some sample data of the Japanese yen versus the US dollar and I've simplified this by

  • Extracting just a period of about 60 days in the middle of some time

  • I can't remember exactly what it is

  • If we plot this you can see that actually there's a big sort of cliff edge

  • Something terrible has happened around day 30 and the value of the Japanese yen is just plummeting

  • And of course, this is absolute nonsense, right? Because this scale goes between 108 and a hundred and fourteen

  • And so if we plot it with a proper axes on you can see that actually it's almost completely flat

  • If your business relies on the exchange rate of a Japanese yen to the US dollar

  • Obviously these small changes might be important right but if you're presenting this in the news

  • It's very easy to claim that something terrible's happened when in fact actually, maybe this is just normal blip up and down, right so

  • You can misuse

  • Plots to serve your purpose right or and you can do it accidentally and waste a huge amount of time

  • Let's have a look at the standard plots

  • You might see right and you could use on a very basic level and see you know

  • What are they appropriate for right because one of the most important things is that you use these plots and these charts

  • Appropriately, alright, so, you know, perhaps the most common one that everyone sees is going to be a bar chart

  • You've got two axes

  • You've got some kind of attributes or labels down here and then you've got some quantity or amount of some attribute here

  • And then you're going to have different bars like this now

  • This is a very nice graph to use it's simple but it's effective because you can very easily see what the difference between these different

  • Levels are right so that you know, it's often going to be your go to graph for lots of things

  • Right, some people now some people try and replace this graph of a pie chart, right? This is a bad idea in general

  • I mean

  • I like pie as much as the next person but if you've got different things

  • Like this and one of them is big

  • I mean you can see that this one's bigger than this one, but how much bigger it is?

  • I don't know

  • You can't see the relative sizes quite so easily this all gets worse if you combine this into a doughnut plot

  • And then you've got multiple pies embedded in each other none of them align and nothing makes any sense anymore, right?

  • So if in doubt don't use a pie chart, it's a bad idea. I mean they look very nice for presentations

  • That's about what I can say for it if we're going to be measuring some call of quantity then a bar charts going to be

  • What we want right but what we might also do is replace quantity with the with the frequency or the amount of something

  • So this is gonna be frequency. This is also our labels again on the bottom here

  • We've got our labels and this is going to be bins for some single attribute

  • So this is maybe so naught to 10 that misses maybe 10 to 20 of whatever the thing is

  • And this is a frequency the amount that fall into that range and what this allows us to do is work out very easily

  • What the distribution is is it normally distributed, but I'm only distributed with two peaks, you know

  • Is it suitable left skewed to the right?

  • We can see very easily the shape of our data and it can be really helpful

  • Another way of looking at this sort of the shape or the range of our data in particular is a box plot right now

  • You'll see box plots come up from time to time with scientific

  • Documents but they're very easy to produce in tools like are and they can be quite useful

  • So here we're gonna have a single attribute

  • So some label again or some attribute here and this is going to be the quantity of this attribute

  • And what a boxplot does is label the range of that data

  • So we're going to have a box here like this and it's going to look a little bit like this

  • So I'll use a different color pen

  • This line in the center is our median typically and then this is going to be the third quartile here

  • Third quartile and this is going to be the first quartile and then these are the max and the min in this one plot

  • We've got the absolute range of our data

  • We've got where 50% of our data is sort of this interquartile range here and we know where the midpoint of our data is

  • So we can very easily see whether we've got

  • outliers and we can plot this next to a different attribute and we can have two box plots next to each other and we can

  • See very quickly, you know a comparison between these two things so that can be really useful now the final ones right?

  • We're going to be talking about scatter plots and trend lines. All right, so it's got to pop very simple. We've got two

  • Attributes, this is attribute one and this is attribute two, and we want to see how they bury with respect to each other

  • So when one goes up does the other one go up or does it go down are they even related to?

  • So you'll see something like this and it'd be all over the place often

  • But you can see maybe there's a kind of trend where as attribute one increases attribute two increases right now

  • This is a correlation being shown here. Not a causation. So you can't say they're definitely related, but you can say that

  • generally speaking when one is big so is the other that's but sometimes useful a

  • Trendline is going to be where we're going to be plotting something over time

  • My so this has to be a continuous variable or at least a variable we believe

  • Can be inferred between our points like it's unlikely, but you're gonna have all the points

  • So you what you might have is you might have a plot where you've got time

  • Down here. So maybe time in mumps, for example

  • And we've got some amount of something and we're just going to plot it like this and we can sort of have a trendline going

  • Like this if it's a situation where we can infer the amount between two time points then this is okay

  • Right because we can say well look we've got a reading here. We've got a reading here

  • It's reasonable to assume that between these two points. This is the amount

  • All right. Nothing to funny's gone on between these two points, right?

  • If you can't assume that then you shouldn't really be using a trendline and you probably want to be using a bar graph

  • Does that depend on the kind of day to them? Yes, it'll depend on it

  • This is a judgment call based on the kind of data

  • So if a data I mean time is a good good example. We don't tend to measure sort of in infinitely small increments

  • We're going to be measuring daily or hourly or something like this

  • but we can kind of make an assumption a lot of the time that our readings like temperature for example over time if

  • You're at 20 and then the next hour you're at 25. We're probably halfway between there to between those two times, right?

  • It's going to depend on your data

  • I mean a good example would be if you were plotting something like operating system usage per student

  • so we've got OS X here, but Linux here and we've got

  • Windows these many people use OS X this many people uses Linux this many people use Windows

  • Well bees have discrete data points. You can't fit a trend line to these. There is no operating system

  • That's 50% between Linux and Windows that I know of and we can't infer

  • How many students are going to be using it that makes no sense? That should be a bar chart?

  • So let's look at an actual data set and see how we can use some of this visualization in practice

  • So I've got here a chicken data set and this data set is about

  • Weighing chickens on different diets over a period of weeks and also measuring how many eggs they produced

  • I'm not a farmer, but let's imagine that what we wanted to do was see if one of these

  • Diets produces a better weight gain and maybe more eggs per week. Let's have a look

  • So I'm going to load the chicken data set. This is at stored in a CSV

  • Just like before let's have a quick look at just the first few rows of this data to see what they look like

  • So that's going to be the head function and we you can see we've got six attributes

  • So we've got the week but the measurement was taken the chicken in this case of chicken number one, but they'll obviously be other chickens

  • diet, they're on a diet B or diet see the age of the chicken in mumps the weight of a chicken in kilograms and the

  • Number of eggs they produce that week. All right, so there's going to be lots of combinations of weeks and chickens in this data set