Name: 方差分析：速成班統計數字#33 (ANOVA: Crash Course Statistics #33)
Uploaded: 2021-01-14T07:55:56.000Z
Duration: 13 min 17 s
Description: 【看影片學英語】數萬部 YouTube 影片，搭配英漢字典即點即查，輕鬆掌握單字發音與用法，長久累積看電影不必再看字幕。

Hi, I'm Adriene Hill, and welcome back to
Crash Course Statistics.

In many of our episodes we've looked at
t-tests, which among other things, are good

for testing the difference between two groups.

Families below the poverty line...and families
above it.

Petri dishes of cells that are treated with
a chemical and those that aren't.

We often want to compare measurements of MORE
than two groups.

Things like ethnicity, medical diagnosis,
country of origin, or job title.

So today, we're going to apply the General
Linear Model Framework we learned in the last episode

to test the difference between multiple groups using a new model called the ANOVA.

The GLM Framework takes all the information
that our data contain, and partitions it into

two piles: information that can be explained
by a model that represents the way we think

things work, and error, which is the amount
of information that our model fails to explain.

So let's apply that to a new model: the
ANOVA.

ANOVA is an acronym for ANalysis Of VAriance.

It's actually very similar to Regression,
except we're using a categorical variable

Like using a soccer player's position to
predict the number of yards he runs in a game.

Or using highest completed degree to predict
a person's salary, note that this alone

isn't evidence that getting a degree causes
a higher salary, just that knowing someone's

degree might help estimate how much they get
paid.

Like Regression, the ANOVA builds a model
of how the world works.

For example, my model for how many bunnies
I'll see on my walk into work might be that

if it's raining I'll see 1 bunny, and
if it's sunny, I'll see 5.

1 and 5 are my predictions for how many bunnies I'll see, based on whether or not it's raining.

And we can represent this model as a sort
of Regression where there are ONLY two possible

values that the Variable Weather can have.

In this case, expected number of bunnies on
a rainy day is 1 and beta is the difference

Which means our ANOVA model looks like this:

In a Regression we did a statistical test
of the slope and that's what this simple

Since we assigned rainy days to be coded as
0, and sunny days as 1, the change in the

So the slope of this line is the difference
between mean bunny count on sunny days, five,

minus mean bunny count on rainy days, one.

This difference of 4 is the change in the
Y direction.

We test this difference in the same way that
we tested the regression slope.

And this slope tells us the difference between
the means of the two groups.

Usually we'll like to think of this slope
as the difference between two group means.

But, knowing that our model treats it like
a slope helps us understand how ANOVAs relate

In a regression the slope tells you how much
an increase in one unit of X affects Y.

Like for example, how much an increase of
1 year increases shoe size in kids.

It looks at how much an increase from 0 (rainy
days) to 1 (non-rainy days) affects the number

Let's look at the ratings of various chocolate
bars based on the type of cocoa bean used.

We'll use a dataset you can find at Kaggle.com
courtesy of Brady Brelinski.

Our three groups are chocolate bars made with Criollo beans, Forastero beans, or Trinitario beans.

Chocolate making is complex, so we took a
small sample of bars that only contained 1

And the chocolate taster used a scale--with
5 as the highest score --transcending beyond

But is there really “mostly unpalatable” chocolate out there?

We want to know if the type of bean affects
our taster's ratings.

Like Regression, we can calculate a Sums of
Squares Total by adding up the squared differences

between each chocolate rating, and the overall
mean chocolate rating.

This gives us our Sums of Squares Total, or
SST.

If that sounds like how we calculated variance,
that's because it is!

This Sum represents the total amount of variation,
or information, in the data.

Now, we need to partition this variation.

When we previously used a simple linear regression
model, we partitioned this variation into

two parts: Sums of Squares for Regression,
and Sums of Squares for Error.

The first step is to figure out how much of
the variation is explained by our model.

In an ANOVA--what we're using here--our
best guess of a chocolate bar's rating is

For bars made with Criollo beans 3.1, Forastero
beans 3.25, and Trinitario beans 3.27.

So we sum up the squared distances between
each point and its group mean.

This is called our Model Sums of Squares (or SSM) because it's the variation our model explains.

So now that we have the amount of variation
explained by the model.

In other words, how much variation is accounted
for if we just assumed each rating value were

We're also going to need the amount of variation
that it DOESN'T explain.

In other words, how much ratings vary within
each group of Cacao beans.

So, we can sum up the squared differences
between each data point and its group mean

to get our Sums of Squares for Error: the
amount of information that our model doesn't explain.

Now that we have that information, we can
calculate our F-statistic, just like we did

The F-statistic compares how much variation
our model accounts for vs. how much it can't

The larger that F is, the more information
our model is able to give us about our chocolate

Again, SSM is the variation our model explains
and SSE is the variation it doesn't explain.

But we also need to account for the amount
of independent information that each one uses.

So, we divide each Sums of Squares by its
degrees of freedom.

Our ANOVA model has 2 degrees of freedom.

字幕列表影片播放

方差分析：速成班統計數字#33 (ANOVA: Crash Course Statistics #33)

significant

multiple

evidence

general