Placeholder Image

字幕列表 影片播放

  • Principal component analysis is perhaps the most widely used data reduction technique on the planet

  • Everyone uses it but here's the thing. It doesn't actually do data reduction

  • Principal component analysis is the idea of trying to find a different view for our data in which we can separate it better

  • And I'll show an example piece of paper

  • And the idea is that what we want to try and do is reframe our data

  • Maybe move it around so that we can better separate things out better cluster things perhaps it's better for machine learning

  • Now as a side effect of this

  • in PCA, we also order our

  • Axes by the most to least useful in some sense. So then we can perform a separate data reduction

  • Technique later by taking the slightly less useful axes away by or in this case

  • Dimensions or attributes of our data PCA is commonly pitched as a data reduction technique. Actually. It's a data transformation technique

  • It just makes our data and meaning able to production later

  • So let's imagine we have some attributes and we we know that some are correlated some are not correlated

  • The problem is that maybe we don't want to just delete some of the attributes may be

  • 0.65 correlation is I mean

  • It's it's good

  • But it's not it doesn't mean we definitely want to delete attribute two and keep only attribute one on the other hand maybe

  • We do need to reduce some of the number of dimensions we've got or maybe we just want to try and make our data

  • More amenable to things like clustering. So let's look at a quick example

  • Typically PCA is done over many dimensions when we've got lots and lots of attributes

  • I'm just going to show two because obviously I starts to break down when I try and draw that many on the page

  • So if we have two attributes

  • And what we want to try and do is work out what the contribution of each of these is to our data set

  • Which of these is useful which of these is not useful and now obviously if we had many dimensions, you know

  • Seven hundred ten thousand we can still apply the same technique. So maybe we have some data that's like this

  • We have some datasets over here and perhaps we have a little gap and maybe some data over here and in general

  • Our data is kind of increasing like this. So this means that attribute one and attribute two are positively correlated to some extent

  • but maybe the correlation is not so strong that we just wanted elite attribute to what we want to try and do is

  • Transform our data into a way where these are more useful imagine that you've got some data but looks a bit like this

  • But if we rotate our data

  • We take a different view we can see there's actually two

  • objects and then we can separate them out and maybe if you were to take them again you could see there was four objects and

  • So on this is the idea what PCA is going to do is find new axes for this data

  • That separate it better for PCA to work. What we would start by doing is standardizing our data

  • So all of our dimensions attribute one attribute to all of the attributes are going to be centered around zero

  • And they're going to have a standard deviation of one

  • PCA will not work really at all. If you have widely different scales for your data

  • So what we want to try and do is find a direction or an axis through these two attributes

  • That separates out our data better than individual attributes

  • Do let's see how this data looks just from attribute one

  • If we trace down this way

  • you can see that it's sort of got this amount of spread in actually Bute one and they kind of

  • Dotted around like this and they sort of should go all the way along like this so you can't really see anything on here

  • Meaningful about these two groups, right?

  • And of course the more dimensions you have the more this could be a problem

  • Similarly about tribute to if we trace along here it goes from this range to this range

  • This is the variance of attribute - like the range and we can see that roughly speaking

  • The data is as spread out in attribute one as it is in attribute to that spread is about the same and both of them

  • are kind of useful for looking of a data but not really because again,

  • We have an equal distribution of point all the way along here. So that's not hugely useful

  • All right. So if we look at just attribute one, that's not hugely helpful

  • If we look at just attribute two, that's not hugely helpful either. So what can we do?

  • well

  • what we want to try and do is find a new axis like some new attribute that fits through this data like this and can

  • Really separate everything out because the spread of this data is actually diagonally in some sense not this way or this way

  • So what principal component analysis is going to do is find this principal component miss axis through our data like this

  • Such that when we look at the spread of a data, it's maximized, right?

  • So the data is as spread out as we can find it

  • And this is going to happen over any number of attributes

  • So actually one here

  • attribute to attribute three attribute for all the way to attribute n when we've got maybe 700 or 800 or

  • $1000 so at the moment which is fitting one principal component, this is one line through our two-dimensional data

  • There's going to be more principal components later, right?

  • But what we want to do is we want to pick the direction through this data

  • However, many attributes it has that has the most spread. So how do we measure this?

  • There's really two goals which were exactly the same one is to maximize the variance

  • So we find a direction for this line such that these points at the very edge are farthest apart

  • The other one is that we minimize the error

  • so we take this error from here this distance this distance from all these points to our new axes and we minimize it so

  • You can imagine if we do this for all our points we can get the sum of the squared

  • distances from these points to this line and then as we move this line around

  • Sometimes if it's going to be better

  • Sometimes it's not if we have a line that goes like this

  • Some of these lines are going to be very large like this and that's going to be a higher amount of error

  • So what we'll find is that if we do this our first principal component will sit through whichever direction in the data

  • minimizes these distances and by definition

  • Maximizes this spread which makes this axis super useful if we use this axis now as our new X and we rotate this whole page

  • All our data is lovely and separated. And actually we have two distinct clusters in this data set, right?

  • So that's what we're going to do

  • Now as I mentioned PCA doesn't typically reduce the number of attributes from two to one just like that

  • We're going to have another principal component which represents the second amount of most variance orthogonal e so at ninety degrees

  • So that's going to be this one here

  • We find the first principal component which maximizes variance and then we find the next one along that maximizes events in the next direction

  • Now if there were multiple dimensions we'd keep applying this process

  • We keep finding new axes for our data that systematically show more and more of a spread of our data

  • But we're crucially we're ordering this by the amount of variance that they represent

  • So this is PC one or principal component one. This is principal component two and

  • Principal component one is always going to have the most

  • Varied data in it principal component to the next most three the next most all the way to the end with the least

  • so a natural

  • Side-effect of this process is that we're going to have new axes through our data

  • Which and we're going to have a same number of axes as there are original dimensions in our data

  • But they're going to get less and less useful in terms of the variance of our data as we go forward

  • So PC one is going to be the most important

  • most of our data is spread out across

  • Pc-1 pc2 a little bit less spread out PC three a little bit less still all the way down to PC n all the way

  • Down here if you wanted to perform dimensionality reduction because you felt you had too many dimensions to your data

  • you could just for example keep the first 10 principal components project your data into that space and

  • Still retain most of the information

  • we won't go into the mathematics of how to calculate these principal components because you can find out very easily online and all has a

  • Lovely function to do it for us

  • I wanted to focus on intuitively what PCA does but how we will actually project these points onto these new

  • axes and rotate the whole thing is

  • Each of these principal components is going to be a weighted sum of all the attributes

  • So for example PC one is going to be some amount of attribute one

  • Added to some amount of attribute two now in this case because it sort of goes off at sort of 45 degrees

  • It's going to be about the same but you could imagine if your data was like this

  • It'll be mostly attribute one and a little bit of attribute two if it was like this

  • It'll be mostly attribute to a little bit attribute one

  • All right. Now, of course the n-dimensional data or we have many more dimensions that I can't draw on the page

  • The principle is exactly the same some amount of attribute one attribute to attribute three and so on all the way to the end

  • Right and that's going to project our points straight onto this line through that data

  • So when we talk about minimizing the error you can imagine

  • Rotating this about the center of these points here like this

  • And as you do this these red lines are going to change in length

  • And it's going to settle on the very center line where these weights are minimized

  • Right and as it happens that also maximizes the variance of these points here because of the fact that this mathematics is based around

  • eigenvectors and eigenvalues

  • Pc2 is always going to come out or foggin all or in this case at 90 degrees to pc one now

  • This is true of however many dimensions. You've got every single new axis that appears or new vector

  • A new principal component is going to come out or foggin all to the ones before

  • Until you run out of dimensions and you can't do it anymore

  • We've already reached the most we can fit in on this two-dimensional plane

  • We've got one here and we've got another one orthogonal to it

  • There is no other lines I can draw for that to be true

  • Right, but obviously if we had more attributes, that would be the case

  • so the reason that it's so important to scare your data appropriately is that you're trying to find the direction for your data that

  • Maximizes the variance now, if one of your dimensions is much much bigger than the other of course

  • That one is the one that's going to maximize the variance

  • if you've got salary that's between naught and

  • 10,000 and all your others are between naught and 1

  • Then your first principal component is going to be predominately salary because that's the most important thing as far as it know

  • If as it knows this is why it's so important to standardize your data first

  • We're going to continue to use our music data set for this video now for those of you

  • Forgotten this data set is a set of music files that are freely available online

  • Where we've got the metadata of a genres or titles for different tracks and then for those tracks

  • We've also calculated some features about the actual audio for example

  • Temporal features how loud they are?

  • How fast the music is how upbeat it is whether you could dance to it this kind of thing

  • Apparently dance ability is a measurable trait

  • Apparently these teachers have been generated by two different libraries once called Lib rosa

  • Which is freely available online and the other ways echo nests

  • Which are the features that a core of Spotify and how it does its music recommender system and its playlists

  • So let's load the data set

  • So I'm going to read it. It takes quite a long time to load

  • It'll probably be faster if it wasn't in a CSV, you've got to remember if your files are in CSV

  • You've got to actually pass the more than workout, whether they're numerical or text, you know for every cell. Okay, so we've got

  • 13,000 instances or rows in our data and we've got

  • 751 attributes or dimensions to our data? So these are going to include features from both liberals

  • ER and echo nest and the other metadata of these tracks

  • So we're going to select just the echo nest features for this part

  • Just be it's a little bit easier to have fewer dimensions to look at this would work just as well

  • On all the other features as long as they're numeric

  • So we're gonna select echo nest is equal to the music data frame all of the rows and just the echo nests columns

  • Which are 528 to the end and then we're going to standardize all this data now

  • So we're going to Center it around 0 a mean of 0 and a standard deviation of 1 using the scale function

  • now take a minute to finish and then we can just check to make sure that our

  • Variance and our mean or what we expected. So we're going to apply over dimension two

  • so that's over all the columns the

  • Variance function and find out what the variances are and you can see they're all one, which is exactly what we want

  • So let's have a look at the mean the mean should be centered about 0

  • It won't be exactly 0 just because of you know, floating point errors and so on. So there we go. So 1.5

  • 10 to the minus 17 very very small right close enough to 0 perfectly fine

  • So the function we're going to use is the PPR comp function in R

  • This is going to perform principal component analysis for those of you who are interested in learning more

  • What it's going to do is create a covariance matrix, and then it's going to use singular value decomposition

  • To find the eigenvectors and the eigenvalues and those are the things that actually we want from a PCA

  • So we're gonna run that now it doesn't take too long. But this is still quite a live data set

  • This will slow down quite a lot if you had a very very large data set, but it still might be worthwhile

  • What it's done is it's found the directions through our data that maximize the variance and it's projected our data into that

  • Space or transformed our data into that space at the moment. The dimensionality of our data is exactly the same and completely unchanged. No

  • Dimensionality reduction has happened yet. So let's perform a quick summary

  • There'll be a lot of the stuff on the screen, but I'll point towards what's important

  • So what it's doing is it's showing us the list of all the compiled

  • They're standard deviations service spread in that direction

  • and

  • also how much of the variance it accounts for

  • You can imagine that

  • Let's imagine a spread of your data and all the different dimensions is this much but in one direction, it's just this much

  • What was the percentage of the spread or the variance that that principle component accounts for right?

  • This is very easily quantified so you can see in here

  • We've got the proportion of variance for PC one is naught point one one six nine

  • which is about eleven point six percent so out of all the

  • 224 echo nest features

  • This weighted sum in principal component one or vistit direction through our data

  • Represents eleven percent of us bread, which it's not too bad. Actually, I think that's pretty good

  • Why principal component two that's another eight percent

  • So the cumulative proportion of these two principal components is going to be twenty percent and at printed component three twenty-five percent and so on

  • So what we're saying is by the time we get to principal component three if we represent our data

  • Is this three dimensional space around these axes pc-1 pc2 PC three?

  • We're getting 25 percent of a spread of the data

  • That we had before but that's three dimensions instead of two hundred and twenty four dimensions. So that's not too bad

  • Now one important thing to look out for is where our spread starts to get towards a hundred percent

  • Where is it in our data set that we can say, you know?

  • what these later dimensions these later principal components are not really adding anything to our

  • Data set so we scroll down and we'll find here at 95 percent

  • scroll down a bit further 98 percent 98 percent and

  • Here we go. Principal component

  • 133 the cumulative proportion of variance explained by all of those ones from 1 all the way to

  • 133 is 99 percent if you're going to perform dimensionality reduction

  • Stopping at 99 percent of the variance is very common

  • What we're saying is we can delete any of our data from principal component

  • 134 all the way to the end and we're still getting 99 percent of a spread or

  • Information from our dataset if you want to use PCA for data reduction

  • Then what you're going to have to do is decide what your cutoff is going to be now 99 percent is a good

  • Number to use what does that actually mean?

  • What it means is if we plotted the different principal components going this way and the amount of barrier. It's

  • That they explain like the amount of a spread of a data that they're responsible for they're going to decrease like this

  • I mean this is going to be a bar chart actually

  • Right so like this so principal component one is always going to be the most variance explained because that's how the mathematics works

  • These are ordered in that way principal component 2 is less three is less four is less and so on

  • we're going to keep going down until

  • 99% of the variance has been explained in some band and we can remove everything else. That's what we're going to do

  • So 99% is one option ninety-five percent something like this

  • Any number of principal components that you remove is going to delete some of your data equivalent to the moving dimensions

  • But because they're ordered in this way from the most useful to the least useful

  • It just makes that job a little bit easier instead of saying it was tempo or feature five that we didn't want

  • actually

  • We're saying in this axis principal component one and principal component to its

  • Principal component one hundred and thirty-four that it's not that useful to us

  • Let's have a look at one of our principal components and see what it is. So we're going to type PCA

  • Followed rotation and we're going to select just the first one because otherwise it's going to be too much information

  • so this is going to be how much of each of our

  • 224 dimensions does pc1 need to create this weighted sum and project our data so you can see for example

  • It's - naught point naught one of tempo or two - two - naught point naught two of tempo or two - three

  • one thing to remember about these is these are now

  • Arbitrary axes through some massively dimensional space very difficult to know exactly what this means, right?

  • You can start to look into based on these weights, which of these features is more useful, but that's kind of a second

  • second step you can use so for example tempo or feature naught is no point naught - so we're going to take

  • Naught point naught to eight times tempo or not

  • Whatever that value is times like this much of the next one times by this much of the next one

  • I'm gonna add them all up and that is a projection of our data point into this new space

  • So we can do this for our entire data set as it happens

  • Our calculates is forced. But you could calculate this using a matrix multiplication if you wanted so these are all our points

  • Transformed into this new space. So hopefully we can see them better

  • Then we're going to start plotting

  • Different genres of music in these principal components to see if if the separation is any better than it was before

  • So let's have a quick look. So this is a scatter plot a principal component 1 vs

  • principal component 2 and every single song in our data set and you can see it's a bit of a higgledy-piggledy mess and it would

  • Be because there's some 13,000 songs here

  • But you can see that maybe some of these songs are bigger over here and some of these over here

  • Maybe let's just look at a few genres to sort of narrow it down and make our figure a little bit clearer

  • So I'm going to select just a rock electronic and classical genres. I don't know they seem like they'd be slightly different

  • So let's run that so we're going to take just those genres

  • We're gonna plot them in the same scatter plot and see where they are in this space

  • So how did this data get into this form? What happened was for every individual track

  • We had a number of features in our 224 dimensional space each of these

  • Principal components is a weighted sum. So for example for let's say track 516 we'd have taken tempo feature 1

  • multiplied it by part of principal component 1 the loadings

  • Added that to the next bit to the next bit to the next bit and worked out where it sits in terms of principal component

  • 1 this new axis, we'd have done the same for principal component 2 and that puts them down here

  • Now there's quite a lot of overlap

  • But you can start to see we're teasing apart the electronic music from the rock music the rock music's sitting over here on the right

  • The electronic music sitting on the lower left and the classical music has up the top here

  • Now these axes don't mean you know

  • That musics faster or slower or more or less upbeat because without looking into the waiting's and below dings for these

  • Principal components. It's impossible to say for sure

  • but what we can say is that they're starting to come apart and there are some differences in our data set the fact that they

  • Still overlap means that probably two dimensions is not enough to satisfactorily

  • Separate out all these things if you wanted to pass this projected and transform data into a machine learning algorithm

  • You'd probably need to pass in more than two dimensions

  • And in this case given that 90% or 99% of the variance is explained after principal compare

  • 133 those

  • 133 dimensions are probably what you'd use

  • You can actually use the entire output a PCA the same number of dimensions you have before to just show a better

  • Rotated version of your data to a machine learning algorithm

  • You don't have to remove any dimensions if you don't want to but because the dimensions are ordered

  • In from most variants to least you can kind of get a good gauge for where you should cut off

  • And remove data that way this kind of data reduction along with the ones we looked at before are going to form part of this

  • data

  • cleaning data

  • transformation and data reduction approach that we're going to iterate through until our data is as small as we can get it and it's it's

  • We can extract as much knowledge as possible in the easiest way

  • once we're done with this our date will be ready for clustering for machine learning for

  • Classification for aggression for anything else that we want to do?

  • Today we're going to talk about clustering

  • Do you ever find when you're on YouTube you'll watch a video on something and then suddenly you're being recommended a load of other videos

  • That you hadn't even heard of that are actually kind of similar. This happens to me. I watch some video

Principal component analysis is perhaps the most widely used data reduction technique on the planet

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

數據分析6:主成分分析(PCA)--電腦愛好者。 (Data Analysis 6: Principal Component Analysis (PCA) - Computerphile)

  • 2 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字