StatQuest：主成分分析 (PCA)（StatQuest: Principal Component Analysis (PCA), Step-by-Step）

字幕列表影片播放

StatQuest breaks it down into bite-sized pieces hooray
Hello, I'm Josh stormer and welcome to stat quest in this StatQuest
We're going to go through principle component analysis PCA one step at a time using singular value decomposition
SVD
You'll learn about what PCA does
How it does it and how to use it to get deeper insight into your data
Let's start with a simple dataset
We've measured the transcription of two genes gene 1 and gene 2 in 6 different mice
Note if you're not into mice and genes
think of the mice as individual samples and
The genes as variables that we measure for each sample
For example the samples could be students in high school and the variables could be test scores in math and reading
Or the samples could be businesses, and the variables could be market capitalization and the number of employees
Ok now we're back to mice and genes because I'm a geneticist and I work in a genetics department if
We only measure one gene. We can plot the data on a number line
mice 1 2 & 3 have relatively high values and
mice 4 5 & 6 have relatively low values
Even though it's a simple graph it shows us that mice 1 2 & 3 are more similar to each other than they are to
mice 4 5 & 6
If we measured two genes then we can plot the data on a two-dimensional XY graph
Gene 1 is the x-axis and spans one of the two dimensions in this graph
Gene - is the y-axis and spans the other dimension
we can see that mice 1 2 & 3 cluster on the right side and
mice 4 5 & 6 cluster on the lower left-hand side
if we measured three genes we would add another axis to the graph and make it look 3d ie
three dimensional
The smaller dots have larger values for gene three and are further away
The larger dots have smaller values for gene three and are closer
If we measured for jeans however we can no longer plot the data
for jeans require four dimensions
All
So we're going to talk about how PCA can take four or more Jean measurements and thus
four or more dimensions of data and make a two dimensional PCA plot
This plot will show us that similar mice cluster together
We'll also talk about how PCA can tell us which gene or variable is the most valuable for clustering the data?
For example PCA might tell us that gene 3 is responsible for separating samples along the x axis
Lastly, we'll talk about how PCA can tell us how accurate the 2d graph is?
To understand what PCA does and how it works let's go back to the dataset that only had two genes
We'll start by plotting the data
Then we'll calculate the average measurement for gene 1 and
the average measurement for gene 2
With the average values we can calculate the center of the data
From this point on we'll focus on what happens in the graph we no longer need the original data
Now we'll shift the data so that the center is on top of the origin in the graph
Note shifting the data did not change how the data points are positioned relative to each other
this point is still the highest one and
This is still the rightmost point
Etc
Now that the data are centered on the origin
We can try to fit a line to it to do this
We start by drawing a random line that goes through the origin
Then we rotate the line until it fits the data as well as it can given that it has to go through the origin
Ultimately this line fits best
But I'm getting ahead of myself first we need to talk about how PCA decides if a fit is good or not
So let's go back to the original random line that goes through the origin
To quantify how good this line fits the data PCA projects the data onto it
And then it can either measure the distances from the data to the line and try to find the line that minimizes those distances
Or it. Can try to find the line that maximizes the distances from the projected points to the origin
If those options don't seem equivalent to you
We can build intuition by looking at how these distances shrink when the line fits better
While these distances get larger when the line fits better
Now to understand what is going on in a mathematical way, let's just consider one data point
This point is fixed and so is its distance from the origin in
Other words the distance from the point to the origin
Doesn't change when the red dotted line rotates
When we project the point onto the line
We get a right angle between the black dotted line and the red dotted line
that means that if we label the sides like this a
b and c
Then we can use the Pythagorean theorem to show how B and C are inversely related
Since a and thus a squared doesn't change
if B gets bigger
then C must get smaller
Likewise if C gets bigger, then B must get smaller
Thus PCA can either minimize the distance to the line or
Maximize the distance from the projected point to the origin
The reason I'm making such a fuss about this is that
Intuitively, it makes sense to minimize B. And the distance from the point to the line
But it's actually easier to calculate C the distance from the projected point to the origin
so PCA finds the best fitting line by
Maximizing the sum of the squared distances from the projected points to the origin
So for this line
PCA projects the data onto it and
Then measures the distance from this point to the origin
Let's call it d sub1
Note I'm going to keep track of the distance as we measure up here and
Then PCA measures the distance from this point to the origin. We'll call that D 2
Then it measures d3
d4
d5 and d6
Here are all six distances that we measured
The next thing we do is Square all of them
The distances are squared so that negative values don't cancel out positive values
Then we sum up all these squared distances and that equals the sum of the squared distances
For short. We'll call this SS distances or sum of squared distances
Now we rotate the line
project the data onto the line and
Then sum up the squared distances from the projected points to the origin
And we repeat until we end up with the line with the largest sum of square
distances between the projected points and the origin
Ultimately we end up with this line it has the largest sum of squared distances
This line is called principal component one or PC one for short
PC one has a slope of
0.25 in
Other words for every four units that we go out along the gene 1 axis
We go up one unit along the gene to access
That means that the data are mostly spread out along the gene one axis and
Only a little bit spread out along the gene to access
One way to think about PC one is in terms of a cocktail recipe
to make PC one
mix four parts gene one
with one part gene to
Pour over ice and serve
The ratio of gene 1 - gene -
Tells you that gene 1 is more important when it comes to describing how the data are spread out
Oh, No terminology alert
mathematicians call this cocktail recipe a linear combination of genes 1 & 2 I
mention this because when someone says PC 1 is a linear combination of variables
This is what they're talking about
It's no big deal
The recipe for PC one going over 4 and up 1 gets us to this point
We can solve for the length of the red line using the Pythagorean theorem the old a squared
equals B squared
plus C squared
Plugging in the numbers gives us a equals four point one two
So the length of the red line is four point one two
When you do pca with SVD the recipe for PC one is scaled so that this length equals one
All we have to do to scale the triangle so that the red line is one unit long is to divide each side by
four point one two
For those of you keeping score
Here's the math worked out that shows that all we need to do is divide all three sides by four point one two
Here are the scaled values
the new values change our recipe
But the ratio is the same we still use four times as much gene one as gene two
So now we are back to looking at the data
the best fitting line and the unit vector that we just calculated oh
No another terminology alert this one unit long vector
consisting of
0.97 parts gene one and
0.24 two parts gene two is called the singular vector or the eigenvector for PC one and
the proportions of each gene are called loading scores
Also while I'm at it
pca calls the sums of squares of the distances
for the best fit line the eigenvalue for pc 1
In the square root of the eigenvalue for pc. One is called the singular value for PC one
BAM that's a lot of terminology
Now that we've got pc1 all figured out, let's work on PC to
Because this is only a two-dimensional graph
PC 2 is simply the line through the origin that is perpendicular to PC 1 without any further
optimization that has to be done
And this means that the recipe for PC 2 is negative 1 parts gene 1 to 4 parts. Gene 2
If we scale everything so that we get a unit vector the recipe is
negative zero point two for two parts gene one and zero point nine seven parts gene -
this is the singular vector for PC - or the eigenvector for PC -
These are the loading scores for PC to
they tell us that in terms of how the values are projected onto PC -
Gene - is four times as important as gene one
Lastly the eigenvalue for pc. - is the sum of squares of the distances between the projected points and the origin
Hooray we've worked out pc1 & pc2
To draw the final PCA plot we simply rotate everything so that PC one is horizontal
Then we use the projected points to find where the samples go in the PCA plot
For example these projected points correspond a sample six
So sample six goes here
sample two goes here and
Sample one goes here etc
Double bam that's how PCA is done using singular value decomposition
Okay one last thing before we dive into a slightly more complicated example
Remember the eigenvalues
We got those by projecting the data onto the principal components
Measuring the distances to the origin then squaring and adding them together
We can convert them into variation around the origin by dividing by the sample size minus one
for the sake of this example
imagine that the variation for pc1 equals 15 and the variation for pc2 equals 3
that means that the total variation around both pcs is 15 plus 3 equals 18 and
That means PC 1 accounts for 15 divided by 18
equals zero point 8 3 or 83 percent of the total variation around the PCs
Pc2 accounts for 3/18 equals 17% of the total variation around the PCs oh
no another terminology alert a scree plot is a graphical representation of the percentages of
variation that each PC accounts for
We'll talk more about scree plot Slater
BAM
Okay now let's quickly go through a slightly more complicated example
PC a with three variables in this case that means three genes is pretty much the same as two variables
You Center the data?
You then find the best fitting line that goes through the origin
Just like before the best fitting line is PC one
But the recipe for pc1 now has three ingredients in
This case Jean 3 is the most important ingredient for pc1
You then find pc2 the next best fitting line
Given that it goes through the origin and is perpendicular to PC one
Here's the recipe for pc2
In this case gene one is the most important ingredient for PC to
Lastly we find
PC three the best fitting line that goes through the origin and is perpendicular pc1 & pc2
If we had more genes we just keep on finding more and more principal components by adding
perpendicular lines and rotating them
in
theory, there is one per gene or variable but in practice the number of PCs is either the number of variables or
the number of samples whichever is smaller
If this is confusing don't sweat it
It's not super important, and I'm going to make a separate video on this topic in the next week
Once you have all the principal components figured out you can use the eigenvalues ie the sums of squares of the distances
to determine the proportion of variation that each PC accounts for in
This case PC one accounts for 79 percent of the variation
PC to accounts for fifteen percent of the variation and
PC three accounts for six percent of the variation
Here's the scree plot
Pc1 & pc2 account for the vast majority of the variation
That means that a 2d graph using just pc1 & pc2
Would be a good approximation of this 3d graph since it would account for
94% of the variation in the data
To convert the 3d graph into a two-dimensional PCA graph
We just strip away everything, but the data and pc1 & pc2
Then project the samples onto pc1 &
Pc2
Then we rotate so that PC one is horizontal in PC two is vertical this just makes it easier to look at
Since these projected points correspond a sample for
This is where sample four goes on our new PCA plot
Etc etc etc
Double bail
To review we started with an awkward 3d graph that was kind of hard to read
Then we calculated the principal components
then with the eigenvalues for pc1 & pc2
We determined that a 2d graph would still be very informative
Lastly we used pc1 & pc2 to draw two dimensional graph with the data
If we measured for jeans per mouse we would not be able to draw a four dimensional graph of the data
wall
But that doesn't stop us from doing the pca math
Which doesn't care if we can draw a picture of it or not and looking at the screen in?
this case
Pc1 & pc2 account for 90% of the variation so we can just use those to draw two dimensional pca graph
So we project the samples onto the first two pcs
These two projected points correspond to sample two
So sample two goes here
BAM
Note if the scree plot looked like this where PC 3 and PC four account for a substantial
Amount of variation then just using the first two pcs would not create a very accurate
representation of the data
Wha-wha
However even a noisy PCA plot like this can be used to identify clusters of data
These samples are still more similar to each other than they are to the other samples
Little bam
Hooray we've made it to the end of another exciting stat quest if you liked this stack quest and want to see more please subscribe
And if you want to support stack quest please consider buying one or two of my original songs
The link to my Bandcamp page is in the lower right corner and in the description below
alright until next time quest on