Placeholder Image

字幕列表 影片播放

  • Hi, I'm Adriene Hill, and Welcome back to Crash Course Statistics.

  • Last week I ordered a pair of gold lame pants with DFTBAQ embroidered on them.

  • The delivery guy said they could come by the next day at exactly 11am on the dot!

  • Just kidding. That never happens.

  • Instead of an exact time, the pants guy gave me a range of times...he said they'd be

  • there sometime between 8am and 2pm.

  • A lot of anticipation

  • We've focused a lot on point estimates, like the mean, which are our best guesses,

  • but we can give ourselves a little more wiggle room.

  • Let's talk about Confidence Intervals.

  • INTRO

  • It's useful to give pregnant mothers a “due datewhen their children will most likely be born.

  • But it might be more accurate to say that doctors expect the baby to come around the

  • due date, not exactly on it.

  • And...when pollsters claim that a candidate will get around 30% of the vote, plus or minus 2%.

  • We can represent thearoundpart with a confidence interval.

  • You may have seen the termconfidence intervalpaired with a percentage like 95%.

  • A “confidence intervalis an estimated range of values that seem reasonable based

  • on what we've observed.

  • It's center is still the sample mean, but we've got some room on either side for our uncertainty.

  • So when the delivery guy says my pants are coming between 8 and 2--he's reflecting

  • his uncertainty...the very LARGE frustrating uncertainty, about when he'll be there.

  • For example, a dentist thinks the mean number of cavities the average person has in a 5

  • year span is greater than 1 and wants to calculate a 95% CI to see if there's evidence that he's right.

  • He rounds up a random sample of 100 patients from around the country, and finds that this

  • group has a mean of 3 cavities with a standard deviation of 0.5 cavities.

  • The way we choose that confidence range is related to the distribution of sample means.

  • The dentist's estimate of the sampling distribution looks like this:

  • And instead of grabbing just the mean, the dentist can include a range of the most common

  • 95% of the sample means that we expect from this estimate of the distribution of sample means.

  • So now we have a 95% confidence interval from 2.902 to 3.098 cavities.

  • Giving a range of numbers instead of just an estimate for the mean better represents

  • the fact that there's some uncertainty and variation when we estimate population parameters--like

  • the mean, proportion, or regression slope--from a sample.

  • The interpretation of this confidence interval is a bit more complex.

  • To understand what a confidence interval really is, we have to ask ourselveswhat if?”.

  • If the dentist's sample was taken again, we wouldn't expect that the mean and standard

  • deviation of cavities would be exactly 3 and 0.5.

  • They'd probably be a little different.

  • Which means that our 95% confidence interval would be different than the one we got before.

  • And if we did it 100 more times with the same sample size, we'd get 100 slightly different

  • confidence intervals.

  • The 95% in a 95% confidence interval tells us that if we calculated a confidence interval

  • from 100 different samples, about 95 of them would contain the true population mean.

  • Ourconfidenceis in the fact that the procedure of calculating this confidence interval

  • will only exclude the population mean 5% of the time.

  • That definition implies that it's possible that the confidence interval that we created

  • doesn't include the true population mean.

  • We have no way of knowing for sure.

  • But the confidence intervals usually contain the true population mean.

  • Now that we know what a confidence interval is, it might be useful to calculate it.

  • A 95% CI is the range that contains the middle 95% of the values of our estimated sampling distribution.

  • And to get that range, we can use a z-score.

  • A z-score tells us the distance between the mean of a distribution and a data point in

  • standard deviations.

  • Previously, we've used z-scores to help us find percentiles.

  • And we want the middle 95% of the data.

  • So we want our cutoffs to be at the 2.5th percentile and the 97.5th percentile so that

  • 95% of the values are within our range, and 5%--2.5% on either side--are not.

  • To calculate the 95% confidence interval for a sample of 49 chocolate cakes with a mean

  • of 3,000 calories and a standard deviation of 500 calories, we can use a z-score of 1.96

  • (which we got from a table) to calculate the 97.5th percentile, and a z-score of -1.96

  • to calculate the 2.5th percentile.

  • But we need to turn our z-scores back into calorie values.

  • To do so, we multiply by the standard error, 71.4 calories and add the mean of 3,000 calories

  • to get the 95% confidence interval for our sample.

  • We think it's likely that the real population mean for number of calories in a chocolate

  • cake is in that range, though we're not sure.

  • What we can have confidence in, is that if we're in a situation where we're constantly

  • taking samples like this and we assume that the true mean is inside of every Confidence

  • Interval, we'll only be wrong 5% of the time.

  • For example, a gummy worm factory periodically checks whether their bagging machines are

  • calibrated correctly.

  • So each week, they take a sample of 100 bags of gummy worms, measure the mean weight and

  • standard deviation, and calculate a 95% confidence interval.

  • They use the Confidence interval to make a decision about whether to pay an expensive

  • repair man to come repair the gummy worm bagging machine.

  • They want their bags of gummy worms to have around 10oz of gummy treats, and decide that

  • as long as the confidence interval contains 10oz--their ideal weight--they'll assume

  • their machine is fine.

  • Decisions based on their confidence intervals will lead them to call an unnecessary repairman

  • only 5% of the time.

  • Many researchers use confidence intervals to see if they contain a certain value of interest.

  • A researcher may want to know if say a certain number of calories in cake is plausible.

  • If the sampled value were to fall within their CI it would seem possible, but it's not

  • possible to rule out even if it's outside the interval.

  • Because you don't know if you got the 95% of CI's that contain the true mean or the

  • 5% that don't.

  • You don't always need to use a confidence interval of 95%, we can calculate other confidence intervals too.

  • You can calculate a 99% confidence interval, or really any percentage confidence interval.

  • But if you try to calculate a 100% confidence interval, it'll always be negative infinity

  • to positive infinity, which just shows that the larger you want your confidence percentage

  • to be, the wider your interval will be.

  • You can be more hopeful that your confidence interval contains the true population mean,

  • but it's not going to be that helpful.

  • So there's a balancing act going on.

  • You want a confidence interval that's narrow enough to be useful, but wide enough that

  • the true population mean will usually be inside a confidence interval of that percent.

  • We can't always have large samples.

  • It's often the case that there's not enough time or money to collect 100s of data points

  • to calculate a confidence interval.

  • With small sample sizes, the distribution of sample means isn't always exactly normal,

  • so we often use a t-distribution instead of a z-distribution to find out where the middle

  • 95% of our data is.

  • The t-distribution, like the z-distribution, is a continuous probability distribution that's unimodal.

  • It's a useful way to represent sampling distributions.

  • The t-distribution changes its shape according to how much information there is.

  • With small sample sizes there's less information so the t-distribution has thicker tails to

  • represent that our estimates are more uncertain when there's not much data.

  • However as we get more and more data, the t-distribution becomes identical to the z-distribution.

  • Generally, sample sizes that are greater than 30 are consideredlarge enoughbecause

  • scientists generally believe that sampling distributions where the sample is 30+ are

  • close enough to normal...though 30 is an arbitrary cutoff just like 0.05.

  • However, when we're estimating population proportions, like the proportion of people

  • who are color blind, the general rule is that your sample size need to be big enough so

  • that on average, you'd expect to get at least 10 colorblind, and at least 10 non-colorblind people.

  • For similar reasons, most people consider thatclose enough”.

  • Since about 8% of males are colorblind, if I only had a sample of 50 males, on average

  • I'd expect around 4 males per group to be color blind, so my sample size wouldn't

  • be quite big enough to assume it's normal.

  • Instead I'd use the almost normal t-distribution.

  • If a drug that's being developed claimed to reduce the proportion of colorblind males

  • born to mothers who took it, we could take a sample of 50 male infants to see if the

  • proportion of colorblindness is different from 8%.

  • Though colorblindness isn't usually life threatening, it can be inconvenient, so you

  • decide to calculate a confidence interval to see if it's likely to be effective.

  • After randomly selecting 50 male infants from mothers who took the drug, you calculate the

  • sample proportion of colorblind infants, which is 6%, and calculate the distribution of sample

  • proportions which has a mean of 6%--the same as the sample mean--and a standard error of 0.033.

  • Since our sample size isn't big enough to assume that the distribution of sample proportions

  • is shaped like the z-distribution, we can use the t-distribution to calculate the range

  • of our 95% confidence interval.

  • I mentioned before that the t-distribution's shape changes with how much data we have.

  • We'll talk more in detail later as to how to choose the right t-distribution, but for

  • now, we'll use this one:

  • While t-score tables do exist, it's often easier to have a statistical program calculate

  • the t-values that correspond to the 2.5th and 97.5th percentiles, since there are many

  • different t-distributions.

  • Your computer tells you that the t-values corresponding to those percentiles are 2.01 and -2.01.

  • And to convert to a raw score from a t-score, we again use this formula, just with a t-score

  • instead of a z-score.

  • Our confidence interval for proportion of colorblind males is -0.6% to 12.63%.

  • 8% is inside our confidence interval, so it's not too much of a stretch to think that 8%

  • could be the true population proportion, even though we only observed a sample proportion of 6%.

  • Based on this confidence interval we don't have any evidence to conclude whether this

  • medicine is effective or not.

  • So since the company researching the drug is pretty cautious, they decide not to go

  • ahead with it.

  • One place you may have seen confidence intervalsin the wildis in the news during election season.

  • When newscasters report results from exit polls they'll usually say something like

  • Candidate A is tracking at 64%, with a margin of error of 3 %” or you may see a

  • chart like this:

  • The margin or error is usually telling you how far the bounds of the confidence interval

  • are from the mean, and is represented by this part of the confidence interval formula:

  • The margin of error, just like a confidence interval, reflects the uncertainty that surrounds

  • sample estimates of parameters like the mean or a proportion.

  • If a poll shows that a Presidential candidate is tracking at of 64% of the vote, plus or

  • minus 3%, we shouldn't be surprised if it turns out that the true vote was 61%, since

  • that's within the margin or error.

  • You can think of values inside the margin of error or confidence interval as values

  • that might be reasonable estimates of the true population parameter.

  • Confidence intervals quantify our uncertainty.

  • They also demonstrate the tradeoff of accuracy for precision.

  • A 100% confidence interval will always contain the true population mean, but it's useless.

  • We have to sacrifice a little bit of accuracy in order to gain more precision.

  • A 99% confidence interval will give us a more useful range since it won't be infinitely

  • long..., but It's now possible that our confidence interval won't contain the true mean.

  • And you've probably encountered this tradeoff in your daily life.

  • Say you're running a marathon (like everybody does) and you want to load up your iPhone

  • with music, but you don't know how long you're going to take, you could buy 150

  • songs on iTunes, which is expensive, or you could buy only 70 and have a chance of running

  • out of music.

  • You increase your risk of not having enough, but then again you're saving yourself from

  • having to buy 80 extra songs

  • Maybe it's time for a streaming service?

  • Confidence intervals demonstrate this delicate balancing act... and help us understand how

  • to hit the sweet spot of information vs. accuracy.

  • Thanks for watching, I'll see you next time in my gold lame pants.

Hi, I'm Adriene Hill, and Welcome back to Crash Course Statistics.

字幕與單字

影片操作 你可以在這邊進行「影片」的調整,以及「字幕」的顯示

B1 中級

信心區間。速成班統計數字#20 (Confidence Intervals: Crash Course Statistics #20)

  • 0 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字