Placeholder Image

字幕列表 影片播放

  • Hi, I'm Adriene Hill and welcome back to Crash Course Statistics.

  • There's something to be said for flexibility.

  • It allows you to adapt to new circumstances.

  • Like a Transformer is a truck, but it can also be an awesome fighting robot.

  • Today we'll introduce you to one of the most flexible statistical tools--the General

  • Linear Model, or GLM.

  • The GLM will allow us to create many different models to help describe the world.

  • The first we'll talk about is The Regression Model.

  • INTRO

  • General Linear Models say that your data can be explained by two things: your model, and

  • some error:

  • First, the model.

  • It usually takes the form Y = mx + b, or rather, Y = b + mx in most cases.

  • Say I want to predict the number of trick-or-treaters I'll get this Halloween by using enrollment

  • numbers from the local middle school.

  • I have to make sure I have enough candy on hand.

  • I expect a baseline of 25 trick-or-treaters.

  • And then for every middle school student, I'll increase the number of trick-or-treaters

  • I expect by 0.01.

  • So this would be my model:

  • There were about 1,000 middle school students nearby last year, so based on my model, I

  • predicted that I'd get 35 trick-or-treaters.

  • But reality doesn't always match predictions.

  • When Halloween came around, I got 42, which means that the error in this case was 7.

  • Now, error doesn't mean that something's WRONG, per se.

  • We call it error because it's a deviation from our model.

  • So the data isn't wrong, the model is.

  • And these errors can come from many sources: like variables we didn't account for in

  • our model-- including the candy-crazed kindergartners from the elementary school--or just random variation

  • Models allow us to make inferences --whether it's the number of kids on my doorstep at

  • Halloween, or the number of credit card frauds committed in a year.

  • General Linear Models take the information that data give us and portion it out into

  • two major parts: information that can be accounted for by our model, and information that can't be.

  • There's many types of GLMS, one is Linear Regression.

  • Which can also provide a prediction for our data.

  • But instead of predicting our data using a categorical variable like we do in a t-test,

  • we use a continuous one.

  • For example, we can predict the number of likes a trending YouTube video gets based

  • on the number of comments that it has.

  • Here, the number of comments would be our input variable and the number of likes our

  • output variable.

  • Our model will look something like this:

  • The first thing we want to do is plot our datafrom 100 videos:

  • This allows us to check whether we think that the data is best fit by a straight line, and

  • look for outliers--those are points that are really extreme compared to the rest of our data.

  • These two points look pretty far away from our data.

  • So we need to decide how to handle them.

  • We covered outliers in a previous episode, and the same rules apply here.

  • We're trying to catch data that doesn't belong.

  • Since we can't always tell when that happened, we set a criteria for what an outlier is,

  • and stick to it.

  • One reason that we're concerned with outliers in regression is that values that are really

  • far away from the rest of our data can have an undue influence on the regression line.

  • Without this extreme point, our line would look like this.

  • But with it, like this.

  • That's a lot of difference for one little point!

  • There's a lot of different ways to decide, but in this case we're gonna leave them in.

  • One of the assumptions that we make when using linear regression, is that the relationship

  • is linear.

  • So if there's some other shape our data takes, we may want to look into some other models.

  • This plot looks linear, so we'll go ahead and fit our regression model.

  • Usually a computer is going to do this part for us, but we want to show you how this line fits.

  • A regression line is the straight line that's as close as possible to all the data points

  • at once.

  • That means that it's the one straight line that minimizes the sum of the squared distance

  • of each point to the line.

  • The blue line is our regression line.

  • Its equation looks like this:

  • This number--the y-intercept--tells us how many likes we'd expect a trending video

  • with zero comments to have.

  • Often, the intercept might not make much sense.

  • In this model, it's possible that you could have a video with 0 comments, but a video

  • with 0 comments and 9104 likes does seem to conflict with our experience on youtube.

  • The slope, aka, the coefficient--tells us how much our likes are determined by the number

  • of comments.

  • Our coefficient here is about 6.5, which means that on average, an increase in 1 comment

  • is associated with an increase of about 6.5 likes.

  • But There's another part of the General Linear Model: the error.

  • Before we go any further, let's take a look at these errors--also called residuals.

  • The residual plot looks like this:

  • And we can tell a lot by looking at its shape.

  • We want a pretty evenly spaced cloud of residuals.

  • Ideally, we don't want them to be extreme in some areas and close to 0 in others.

  • It's especially concerning if you can see a weird pattern in your residuals like this:

  • Which would indicate that the error of your predictions is dependent on how big your predictor

  • variable value is.

  • That would be like if our YouTube model was pretty accurate at predicting the number of

  • likes for videos with very few comments, but was wildly inaccurate on videos with a lot

  • of comments.

  • So, now that we've looked at this error, This is where Statistical tests come in.

  • There are actually two common ways to do a Null Hypothesis Significance test on a regression coefficient.

  • Today we'll cover the F-test.

  • The F-test, like the t-test, helps us quantify how well we think our data fit a distribution,

  • like the null distribution.

  • Remember, the general form of many test statistics is this:

  • But I'm going to make one small tweak to the wording of our general formula to help

  • us understand F-tests a little better.

  • The null hypothesis here is that there's NO relationship between the number of comments

  • on a trending YouTube video and the number of likes.

  • IF that were true, we'd expect a kind of blob-y, amorphous-cloud-looking scatter plot

  • and a regression line with a slope of 0.

  • It would mean that the number of comments wouldn't help us predict the number of likes.

  • We'd just predict the mean number of likes no matter how many comments there were.

  • Back to our actual data.

  • This blue line is our observed model.

  • And the red is the model we'd expect if the null hypothesis were true.

  • Let's add some notation so it's easier to read our formulas.

  • Y-hat looks like this, and it represents the predicted value for our outcome variable--here

  • it's the predicted number of likes.

  • Y-bar looks like this, and it represents the mean value of likes in this sample.

  • Taking the squared difference between each data point and the mean line tells us the

  • total variation in our data set.

  • This might look similar to how we calculated variance, because it is.

  • Variance is just this sum of squared deviations--called the Sum of Squares Total--divided by N.

  • And we want to know how much of that total Variation is accounted for by our regression

  • model, and how much is just error.

  • That would allow us to follow the General Linear Model framework and explain our data

  • with two things: the model's prediction, and error.

  • We can look at the difference between our observed slope coefficient--6.468--and the

  • one we'd expect if there were no relationship--0, for each point.

  • And we'll start here with this point:

  • The green line represents the difference between our observed model--which is the blue line--and

  • the model that would occur if the null were true--which is the red line.

  • And we can do this for EVERY point in the data set.

  • We want negative differences and positive differences to count equally, so we square

  • each difference so that they're all positive.

  • Then we add them all up to get part of the numerator of our F-statistic:

  • The numerator has a special name in statistics.

  • It's called the Sums of Squares for Regression, or SSR for short.

  • Like the name suggests, this is the sum of the squared distances between our regression

  • model and the null model.

  • Now we just need a measure of average variation.

  • We already found a measure of the total variation in our sample data, the Total Sums of Squares.

  • And we calculated the variation that's explained by our model.

  • The other portion of the variation should then represent the error, the variation of

  • data points around our model.

  • Shown here in Orange.

  • The sum of these squared distances are called the Sums of Squares for Error (SSE).

  • If data points are close to the regression line, then our model is pretty good at predicting

  • outcome values like likes on trending YouTube Videos.

  • And so our SSE will be small.

  • If the data are far from the regression line, then our model isn't too good at predicting

  • outcome values.

  • And our SSE is going to be big.

  • Alright, so now we have all the pieces of our puzzle.

  • Total Sums of Squares, Sums of Squares for Regression, and Sums of Squares for Error:

  • Total Sums of Squares represents ALL the information that we have from our Data on YouTube likes.

  • Sums of Squares for Regression represents the proportion of that information that we

  • can explain using the model we created.

  • And Sums of Squares for Error represents the leftover information--the portion of Total

  • Sums of Squares that the model can't explain.

  • So the Total Sums of Squares is the Sum of SSR and SSE.

  • Now we've followed the General Linear Model framework and taken our data and portioned

  • it into two categories: Regression Model, and Error.

  • And now that we have the SSE, our measurement of error, we can finally start to fill in

  • the Bottom of our F-statistic.

  • But we're not quite done yet.

  • The last and final step to getting our F-statistic is to divide each Sums of Squares by their

  • respective Degrees of freedom.

  • Remember degrees of freedom represent the amount of independent information that we have.

  • The sums of square error has n--the sample size--minus 2 degrees of freedom.

  • We had 100 pieces of independent information from our data, and we used 1 to calculate

  • the y-intercept and 1 to calculate the regression coefficient.

  • So the Sums of Squares for Error has 98 degrees of freedom.

  • The Sums of Squares for Regression has one degree of freedom, because we're using one

  • piece of independent information to estimate our coefficient our slope.

  • We have to divide each sums of squares by its degrees of freedom because we want to

  • weight each one appropriately.

  • More degrees of freedom mean more information.

  • It's like how you wouldn't be surprised that Katie Mack who has a PhD in AstroPhysics

  • can explain more about the planets than someone taking a high school Physics class.

  • Of course she can she has way more information.

  • Similarly, we want to make sure to scale the Sums of Squares based on the amount of independent

  • information each have.

  • So we're finally left with this:

  • And using an F-distribution, we can find our p-value: the probability that we'd get a

  • F statistic as big or bigger than 59.613.

  • Our p-value is super tiny.

  • It's about 0.000-000-000-000-99.

  • With an alpha level of 0.05, we reject the null that there is NO relationship between

  • likes and YouTube comments on trending videos.

  • So we reject that true coefficient for the relationship between likes and comments on

  • YouTube is 0.

  • The F-statistic allows us to directly compare the amount of variation that our model can

  • and cannot explain.

  • When our model explains a lot of variation, we consider it statistically significant.

  • And it turns out, if we did a t-test on this coefficient, we'd get the exact same p-value.

  • That's because these two methods of hypothesis testing are equivalent, in fact if you square

  • our t-statistic, you'll get our F-statistic!

  • And we're going to talk more about why F-tests are important later.

  • Regression is a really useful tool to understand.

  • Scientists, economists, and political scientists use it to make discoveries and communicate

  • those discoveries to the public.

  • Regression can be used to model the relationship between increased taxes on cigarettes and

  • the average number of cigarettes people buy.

  • Or to show the relationship between peak-heart-rate-during-exercise and blood pressure.

  • Not that we're able to use regression alone to determine if it causes changes.

  • But more abstractly, we learned today about the General Linear Model framework.

  • What happens in life can be explained by two things: what we know about how the world works,

  • and error--or deviations--from that model.

  • Like say you budgeted $30 for gas and only ended up needing $28 last week.

  • The reality deviated from your guess and now you get to to go to The Blend Den again!

  • Or just how angry your roommate is that you left dishes in the sink can be explained by

  • how many days you left them out with a little wiggle room for error depending on how your

  • roommate's day was.

  • Alright, thanks for watching, I'll see you next time.

Hi, I'm Adriene Hill and welcome back to Crash Course Statistics.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

迴歸。統計速成班#32 (Regression: Crash Course Statistics #32)

  • 1 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字