 ## 字幕列表 影片播放

• Hi, I'm Adriene Hill and welcome back to Crash Course Statistics.

• There's something to be said for flexibility.

• It allows you to adapt to new circumstances.

• Like a Transformer is a truck, but it can also be an awesome fighting robot.

• Today we'll introduce you to one of the most flexible statistical tools--the General

• Linear Model, or GLM.

• The GLM will allow us to create many different models to help describe the world.

• The first we'll talk about is The Regression Model.

• INTRO

• General Linear Models say that your data can be explained by two things: your model, and

• some error:

• First, the model.

• It usually takes the form Y = mx + b, or rather, Y = b + mx in most cases.

• Say I want to predict the number of trick-or-treaters I'll get this Halloween by using enrollment

• numbers from the local middle school.

• I have to make sure I have enough candy on hand.

• I expect a baseline of 25 trick-or-treaters.

• And then for every middle school student, I'll increase the number of trick-or-treaters

• I expect by 0.01.

• So this would be my model:

• There were about 1,000 middle school students nearby last year, so based on my model, I

• predicted that I'd get 35 trick-or-treaters.

• But reality doesn't always match predictions.

• When Halloween came around, I got 42, which means that the error in this case was 7.

• Now, error doesn't mean that something's WRONG, per se.

• We call it error because it's a deviation from our model.

• So the data isn't wrong, the model is.

• And these errors can come from many sources: like variables we didn't account for in

• our model-- including the candy-crazed kindergartners from the elementary school--or just random variation

• Models allow us to make inferences --whether it's the number of kids on my doorstep at

• Halloween, or the number of credit card frauds committed in a year.

• General Linear Models take the information that data give us and portion it out into

• two major parts: information that can be accounted for by our model, and information that can't be.

• There's many types of GLMS, one is Linear Regression.

• Which can also provide a prediction for our data.

• But instead of predicting our data using a categorical variable like we do in a t-test,

• we use a continuous one.

• For example, we can predict the number of likes a trending YouTube video gets based

• on the number of comments that it has.

• Here, the number of comments would be our input variable and the number of likes our

• output variable.

• Our model will look something like this:

• The first thing we want to do is plot our datafrom 100 videos:

• This allows us to check whether we think that the data is best fit by a straight line, and

• look for outliers--those are points that are really extreme compared to the rest of our data.

• These two points look pretty far away from our data.

• So we need to decide how to handle them.

• We covered outliers in a previous episode, and the same rules apply here.

• We're trying to catch data that doesn't belong.

• Since we can't always tell when that happened, we set a criteria for what an outlier is,

• and stick to it.

• One reason that we're concerned with outliers in regression is that values that are really

• far away from the rest of our data can have an undue influence on the regression line.

• Without this extreme point, our line would look like this.

• But with it, like this.

• That's a lot of difference for one little point!

• There's a lot of different ways to decide, but in this case we're gonna leave them in.

• One of the assumptions that we make when using linear regression, is that the relationship

• is linear.

• So if there's some other shape our data takes, we may want to look into some other models.

• This plot looks linear, so we'll go ahead and fit our regression model.

• Usually a computer is going to do this part for us, but we want to show you how this line fits.

• A regression line is the straight line that's as close as possible to all the data points

• at once.

• That means that it's the one straight line that minimizes the sum of the squared distance

• of each point to the line.

• The blue line is our regression line.

• Its equation looks like this:

• This number--the y-intercept--tells us how many likes we'd expect a trending video

• with zero comments to have.

• Often, the intercept might not make much sense.

• In this model, it's possible that you could have a video with 0 comments, but a video

• with 0 comments and 9104 likes does seem to conflict with our experience on youtube.

• The slope, aka, the coefficient--tells us how much our likes are determined by the number

• of comments.

• Our coefficient here is about 6.5, which means that on average, an increase in 1 comment

• is associated with an increase of about 6.5 likes.

• But There's another part of the General Linear Model: the error.

• Before we go any further, let's take a look at these errors--also called residuals.

• The residual plot looks like this:

• And we can tell a lot by looking at its shape.

• We want a pretty evenly spaced cloud of residuals.

• Ideally, we don't want them to be extreme in some areas and close to 0 in others.

• It's especially concerning if you can see a weird pattern in your residuals like this:

• Which would indicate that the error of your predictions is dependent on how big your predictor

• variable value is.

• That would be like if our YouTube model was pretty accurate at predicting the number of

• likes for videos with very few comments, but was wildly inaccurate on videos with a lot

• of comments.

• So, now that we've looked at this error, This is where Statistical tests come in.

• There are actually two common ways to do a Null Hypothesis Significance test on a regression coefficient.

• Today we'll cover the F-test.

• The F-test, like the t-test, helps us quantify how well we think our data fit a distribution,

• like the null distribution.

• Remember, the general form of many test statistics is this:

• But I'm going to make one small tweak to the wording of our general formula to help

• us understand F-tests a little better.

• The null hypothesis here is that there's NO relationship between the number of comments

• on a trending YouTube video and the number of likes.

• IF that were true, we'd expect a kind of blob-y, amorphous-cloud-looking scatter plot

• and a regression line with a slope of 0.

• It would mean that the number of comments wouldn't help us predict the number of likes.

• We'd just predict the mean number of likes no matter how many comments there were.

• Back to our actual data.

• This blue line is our observed model.

• And the red is the model we'd expect if the null hypothesis were true.

• Let's add some notation so it's easier to read our formulas.

• Y-hat looks like this, and it represents the predicted value for our outcome variable--here

• it's the predicted number of likes.

• Y-bar looks like this, and it represents the mean value of likes in this sample.

• Taking the squared difference between each data point and the mean line tells us the

• total variation in our data set.

• This might look similar to how we calculated variance, because it is.

• Variance is just this sum of squared deviations--called the Sum of Squares Total--divided by N.

• And we want to know how much of that total Variation is accounted for by our regression

• model, and how much is just error.

• That would allow us to follow the General Linear Model framework and explain our data

• with two things: the model's prediction, and error.

• We can look at the difference between our observed slope coefficient--6.468--and the

• one we'd expect if there were no relationship--0, for each point.

• And we'll start here with this point:

• The green line represents the difference between our observed model--which is the blue line--and

• the model that would occur if the null were true--which is the red line.

• And we can do this for EVERY point in the data set.

• We want negative differences and positive differences to count equally, so we square

• each difference so that they're all positive.

• Then we add them all up to get part of the numerator of our F-statistic:

• The numerator has a special name in statistics.

• It's called the Sums of Squares for Regression, or SSR for short.

• Like the name suggests, this is the sum of the squared distances between our regression

• model and the null model.

• Now we just need a measure of average variation.

• We already found a measure of the total variation in our sample data, the Total Sums of Squares.

• And we calculated the variation that's explained by our model.

• The other portion of the variation should then represent the error, the variation of

• data points around our model.

• Shown here in Orange.

• The sum of these squared distances are called the Sums of Squares for Error (SSE).

• If data points are close to the regression line, then our model is pretty good at predicting

• outcome values like likes on trending YouTube Videos.

• And so our SSE will be small.

• If the data are far from the regression line, then our model isn't too good at predicting

• outcome values.

• And our SSE is going to be big.

• Alright, so now we have all the pieces of our puzzle.

• Total Sums of Squares, Sums of Squares for Regression, and Sums of Squares for Error:

• Total Sums of Squares represents ALL the information that we have from our Data on YouTube likes.

• Sums of Squares for Regression represents the proportion of that information that we

• can explain using the model we created.

• And Sums of Squares for Error represents the leftover information--the portion of Total

• Sums of Squares that the model can't explain.

• So the Total Sums of Squares is the Sum of SSR and SSE.

• Now we've followed the General Linear Model framework and taken our data and portioned

• it into two categories: Regression Model, and Error.

• And now that we have the SSE, our measurement of error, we can finally start to fill in

• the Bottom of our F-statistic.

• But we're not quite done yet.

• The last and final step to getting our F-statistic is to divide each Sums of Squares by their

• respective Degrees of freedom.

• Remember degrees of freedom represent the amount of independent information that we have.

• The sums of square error has n--the sample size--minus 2 degrees of freedom.

• We had 100 pieces of independent information from our data, and we used 1 to calculate

• the y-intercept and 1 to calculate the regression coefficient.

• So the Sums of Squares for Error has 98 degrees of freedom.

• The Sums of Squares for Regression has one degree of freedom, because we're using one

• piece of independent information to estimate our coefficient our slope.

• We have to divide each sums of squares by its degrees of freedom because we want to

• weight each one appropriately.

• More degrees of freedom mean more information.

• It's like how you wouldn't be surprised that Katie Mack who has a PhD in AstroPhysics

• can explain more about the planets than someone taking a high school Physics class.

• Of course she can she has way more information.

• Similarly, we want to make sure to scale the Sums of Squares based on the amount of independent

• information each have.

• So we're finally left with this:

• And using an F-distribution, we can find our p-value: the probability that we'd get a

• F statistic as big or bigger than 59.613.

• Our p-value is super tiny.

• It's about 0.000-000-000-000-99.

• With an alpha level of 0.05, we reject the null that there is NO relationship between

• likes and YouTube comments on trending videos.

• So we reject that true coefficient for the relationship between likes and comments on

• YouTube is 0.

• The F-statistic allows us to directly compare the amount of variation that our model can

• and cannot explain.

• When our model explains a lot of variation, we consider it statistically significant.

• And it turns out, if we did a t-test on this coefficient, we'd get the exact same p-value.

• That's because these two methods of hypothesis testing are equivalent, in fact if you square

• our t-statistic, you'll get our F-statistic!

• And we're going to talk more about why F-tests are important later.

• Regression is a really useful tool to understand.

• Scientists, economists, and political scientists use it to make discoveries and communicate

• those discoveries to the public.

• Regression can be used to model the relationship between increased taxes on cigarettes and

• the average number of cigarettes people buy.

• Or to show the relationship between peak-heart-rate-during-exercise and blood pressure.

• Not that we're able to use regression alone to determine if it causes changes.

• But more abstractly, we learned today about the General Linear Model framework.

• What happens in life can be explained by two things: what we know about how the world works,

• and error--or deviations--from that model.

• Like say you budgeted \$30 for gas and only ended up needing \$28 last week.

• The reality deviated from your guess and now you get to to go to The Blend Den again!

• Or just how angry your roommate is that you left dishes in the sink can be explained by

• how many days you left them out with a little wiggle room for error depending on how your

• roommate's day was.

• Alright, thanks for watching, I'll see you next time.

Hi, I'm Adriene Hill and welcome back to Crash Course Statistics.

B1 中級

# 迴歸。統計速成班#32 (Regression: Crash Course Statistics #32)

• 1 0
林宜悉 發佈於 2021 年 01 月 14 日