字幕列表 影片播放 列印英文字幕 [♪ INTRO] A little over a decade ago, a neuroscientist stopped by a grocery store on his way to his lab to buy a large Atlantic salmon. The fish was placed in an MRI machine, and then it completed what was called an "open-ended mentalizing task" where it was asked to determine the emotions that were being experienced by different people in photos. Yes, the salmon was asked to do that. The dead one from the grocery store. But that's not the weird part. The weird part is that researchers found that so-called significant activation occurred in neural tissue in a couple places in the dead fish. Turns out, this was a little bit of a stunt. The researchers weren't studying the mental abilities of dead fish; they wanted to make a point about statistics, and how scientists use them. Which is to say, stats can be done wrong, so wrong that they can make a dead fish seem alive. A lot of the issues surrounding scientific statistics come from a little something called a p-value. The p stands for probability, and it refers to the probability that you would have gotten the results you did just by chance. There are lots of other ways to provide statistical support for your conclusion in science, but p-value is by far the most common, and, I mean, it's literally what scientists mean when they report that their findings are “significant”. But it's also one of the most frequently misused and misunderstood parts of scientific research. And some think it's time to get rid of it altogether. The p-value was first proposed by a statistician named Ronald Fisher in 1925. Fisher spent a lot of time thinking about how to determine if the results of a study were really meaningful. And, at least according to some accounts, his big breakthrough came after a party in the early 1920s. At this party there was a fellow scientist named Muriel Bristol, and reportedly, she refused a cup of tea from Fisher because he had added milk after the tea was poured. She only liked her tea when the milk was added first. Fisher didn't believe she could really taste the difference, so he and a colleague designed an experiment to test her assertion. They made eight cups of tea, half of which were milk first, and half of which were tea first. The order of the cups was random, and, most importantly, unknown to Bristol, though she was told there would be four of each cup. Then, Fisher had her taste each tea one by one and say whether it that cup was milk or tea first. And to Fisher's great surprise, she went 8 for 8. She guessed correctly every time which cup was tea-first and which was milk-first! And that got him to thinking, what are the odds that she got them all right just by guessing? In other words, if she really couldn't taste the difference, how likely would it be that she got them all right? He calculated that are 70 possible orders for the 8 cups if there are four of each mix. Therefore, the probability that she'd guess the right one by luck alone is 1 in 70. Written mathematically, the value of P is about 0.014. That, in a nutshell, is a p-value, the probability that you'd get that result if chance is the only factor. In other words, there's really no relationship between the two things you're testing, in this case, how tea is mixed versus how it tastes, but you could still wind up with data that suggest there is a relationship. Of course, the definition of “chance” varies depending on the experiment, which is why p-values depend a lot on experimental design. Say Fisher had only made 6 cups, 3 of each tea mix. Then, there are only 20 possible orders for the cups, so the odds of getting them all correct is 1 in 20, a p-value of 0.05. Fisher went on to describe an entire field of statistics based on this idea, which we now call Null Hypothesis Significance Testing. The “null hypothesis” refers to the experiment's assumption of what “by chance” looks like. Basically, researchers calculate how likely it is that they've gotten the data that they did, even if the effect they're testing for doesn't exist. Then, if the results are extremely unlikely to occur if the null hypothesis is true, then they can infer that it isn't. So, in statistical speak, with a low enough p-value, they can reject the null hypothesis, leaving them with whatever alternate hypothesis they had as the explanation for the results. The question becomes, how low does a p-value have to be before you can reject that null hypothesis. Well, the standard answer used in science is less than 1 in 20 odds, or a p-value below 0.05. The problem is, that's an arbitrary choice. It also traces back to Fisher's 1925 book, where he said 1 in 20 was quote “convenient”. A year later, he admitted the cutoff was somewhat subjective, but that 0.05 was generally his personal preference. Since then, the 0.05 threshold has become the gold standard in scientific research. A p of less than 0.05, and your results are quote “significant”. It's often talked about as determining whether or not an effect is real. But the thing is, a result with a p-value of 0.049 isn't more true than one with a p-value of 0.051. It's just ever so slightly less likely to be explained by chance or sampling error. This is really key to understand. You're not more right if you get a lower p-value, because a p-value says nothing about how correct your alternate hypothesis is. Let's bring it back to tea for a moment. Bristol aced Fisher's 8-cup study by getting them all correct, which as we noted, has a p-value of 0.014, solidly below the 0.05 threshold. But it being unlikely that she randomly guessed doesn't prove she could taste the difference. See, it tells us nothing about other possible explanations for her correctness. Like, if the teas had different colors rather than tastes. Or she secretly saw Fisher pouring each cup! Also, it still could have been a one-in-seventy fluke. And sometimes, one might even argue often, 1 in 20 is not a good enough threshold to really rule out that a result is a fluke. Which brings us back to that seemingly undead fish. The spark of life detected in the salmon was actually an artifact of how MRI data is collected and analyzed. See, when researchers analyze MRI data, they look at small units about a cubic millimeter or two in volume. So for the fish, they took each of these units and compared the data before and after the pictures were shown to the fish. That means even though they were just looking at one dead fish's brain before and after, they were actually making multiple comparisons, potentially, thousands of them. The same issue crops up in all sorts of big studies with lots of data, like nutritional studies where people provide detailed diet information about hundreds of foods, or behavioral studies where participants fill out surveys with dozens of questions. In all cases, even though each individual comparison is unlikely, with enough comparisons, you're bound to find some false positives. There are statistical solutions for this problem, of course, which are simply known as multiple comparison corrections. Though they can get fancy, they usually amount to lowering the threshold for p-value significance. And to their credit, the researchers who looked at the dead salmon also ran their data with multiple comparison corrections, when they did, their data was no longer significant. But not everyone uses these corrections. And though individual studies might give various reasons for skipping them, one thing that's hard to ignore is that researchers are under a lot of pressure to publish their work, and significant results are more likely to get published. This can lead to p-hacking: the practice of analyzing or collecting data, until you get significant p-values. This doesn't have to be intentional, because researchers make many small choices that lead to different results, like we saw with 6 versus 8 cups of tea. This has become such a big issue because, unlike when these statistics were invented, people can now run tests lots of different ways fairly quickly and cheaply, and just go with what's most likely to get their work published. Because of all of these issues surrounding p-values, some are arguing that we should get rid of them altogether. And one journal has totally banned them. And many that say we should ditch the p-value are pushing for an alternate statistical system called Bayesian statistics. P-values, by definition, only examine null hypotheses. The result is then used to infer if the alternative is likely. Bayesian statistics actually look at the probability of both the null and alternative hypotheses. What you wind up with is an exact ratio of how likely one explanation is compared to another. This is called a Bayes factor. And this is a much better answer if you want to know how likely you are to be wrong. This system was around when Fisher came up with p-values. But, depending on the dataset, calculating Bayes factors can require some serious computing power, power that wasn't available at the time, since, y'know, it was before computers. Nowadays, you can have a huge network of computers thousands of miles from you to run calculations while you throw a tea party. But the truth is, replacing p-values with Bayes factors probably won't fix everything. A loftier solution is to completely separate a study's publishability from its results. This is the goal of two-step manuscript submission, where you submit an introduction to your study and a description of your method, and the journal decides whether to publish before seeing your results. That way, in theory at least, studies would get published based on whether they represent good science, not whether they worked out the way researchers hoped, or whether a p-value or Bayes factor was more or less than some arbitrary threshold. This sort of idea isn't widely used yet, but it may become more popular as statistical significance meets more sharp criticism. In the end, hopefully, all this controversy surrounding p-values means that academic culture is shifting toward a clearer portrayal of what research results do and don't really show. And that will make things more accessible for all of us who want to read and understand science, and keep any more zombie fish from showing up. Now, before I go make myself a cup of Earl Grey, milk first, of course, I want to give a special shout out to today's President of Space, SR Foxley. Thank you so much for your continued support! Patrons like you give us the freedom to dive deep into complex topics like p-values, so really, we can't thank you enough. And if you want to join SR in supporting this channel and the educationalcontent we make here at SciShow, you can learn more at Patreon.com/SciShow. Cheerio! [♪ OUTRO]
B1 中級 P值打破了科學統計--我們能解決它們嗎? (P-values Broke Scientific Statistics—Can We Fix Them?) 3 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字