能證實任何事情的科學方法？數據分析的謬誤 (The method that can "prove" almost anything - James A. Smith (The method that can "prove" almost anything - James A. Smith)

字幕列表影片播放

已審核字幕已審核

In 2011, a group of researchers conducted a scientific study to find an impossible result: that listening to certain songs can make you younger.

2011 年，一組研究人員進行了一項科學研究，發現了一個不可能的結果：聽某些歌曲可以讓你更年輕。
Their study involved real people, truthfully reported data, and commonplace statistical analyses.

他們的研究涉及真實的人、如實的數據報告和常見的統計分析。
So, how did they do it?

那麼他們是如何做到的呢？
The answer lies in a statistical method scientists often use to try to figure out whether their results mean something or if they're random noise.

答案在於科學家們經常使用的一種統計方法，用來弄清楚他們的結果是否具有意義，或者只是隨機的變數。
In fact, the whole point of the music study was to point out ways this method can be misused.

事實上，音樂研究的重點是指出這種方法可能被濫用的方式。
A famous thought experiment explains the method.

一個著名的思想實驗解釋了這個方法。
There are eight cups of tea, four with the milk added first and four with the tea added first.

有八杯茶，四杯先加牛奶，四杯先加茶。
A participant must determine which are which according to taste.

試驗者必須根據味道確定哪些是哪些。
There are 70 different ways the cups can be sorted into two groups of four and only one is correct.

有 70 種不同的方法可以將杯子分成兩組，每組四個，只有一種是正確的。
So, can she taste the difference?

那麼，她能嚐出其中的不同嗎？
That's our research question.

這就是我們的研究問題。
To analyze her choices, we define what's called a null hypothesis, that she can't distinguish the teas.

為了分析她的選擇，我們定義了所謂的零假設，就是她無法區分出茶。
If she can't distinguish the teas, she'll still get the right answer 1 in 70 times by chance.

如果她無法做出區分，她仍然會在 70 次中答出正確答案。
1 in 70 is roughly .014⏤that single number is called a p-value.

70 分之 1 大約是 0.014，而這個單一數字則稱為 p 值。
In many fields, a p-value of .05 or below is considered statistically significant, meaning there's enough evidence to reject the null hypothesis.

在許多領域中，0.05 或以下的 p 值被認為具有統計顯著性，這意味著有足夠的證據來反駁零假設。
Based on a p-value of .014, they'd rule out the null hypothesis that she can't distinguish the teas.

基於 0.014 的 p 值，他們排除了她無法區分茶的零假設。
Though p-values are commonly used by both researchers and journals to evaluate scientific results, they're really confusing, even for many scientists.

儘管 p 值通常被研究人員和期刊用於評估科學結果，但它們確實令人困惑，即使對許多科學家來說也是如此。
That's partly because all a p-value actually tells us is the probability of getting a certain result, assuming the null hypothesis is true.

有一部分是因為 p 值實際上告訴我們的是得到某個結果的概率，假設零假設為真。
So if she correctly sorts the teas, the p-value is the probability of her doing so assuming she can't tell the difference.

因此，如果她正確地對茶進行了分類，則 p 值是假設她無法分辨差異的情況下她這樣做的概率。
But the reverse isn't true: the p-value doesn't tell us the probability that she can taste the difference, which is what we're trying to find out.

但反之則不然：p 值並沒有告訴我們她能嘗出差異的概率，而這正是我們試圖要找出的數據。
So if a p-value doesn't answer the research question, why does the scientific community use it?

因此，如果 p 值不能給出研究問題的答案，為什麼科學界要使用它呢？
Well, because even though a p-value doesn't directly state the probability that the results are due to random chance, it usually gives a pretty reliable indication.

其實，因為即使 p 值不直接說明結果是由隨機的機會引起的概率，它通常也提供了非常可靠的指示。
At least, it does when used correctly. And that's where many researchers, and even whole fields, have run into trouble.

至少，它在正確使用時確實如此，而這就是許多研究人員，甚至整個領域都會遇到麻煩的地方。
Most real studies are more complex than the tea experiment. Scientists can test their research question in multiple ways, and some of these tests might produce a statistically significant result, while others don't.

大多數真正的研究比區分茶的實驗更複雜。科學家可以通過多種方式測試他們的研究問題，其中一些測試可能會產生具有統計意義的結果，而另一些則不會。
It might seem like a good idea to test every possibility. But it's not, because with each additional test, the chance of a false positive increases.

測試每一種可能性似乎是個好主意，但事實並非如此，因為每進行一次額外的測試，誤報的機率就會增加。
Searching for a low p-value, and then presenting only that analysis, is often called p-hacking.

調查低的 p 值，然後僅呈現該分析，通常稱為 p 值駭客。
It's like throwing darts until you hit a bullseye and then saying you only threw the dart that hit the bull's eye. This is exactly what the music researchers did.

這就像扔飛鏢直到擊中靶心，然後說你只扔出了擊中靶心的飛鏢：這正是音樂研究人員所做的事情。
They played three groups of participants each a different song and collected lots of information about them.

他們為三組試驗者演奏了不同的歌曲，並收集了關於他們的大量信息。
The analysis they published included only two out of the three groups.

他們發表的分析只包括三組中的兩組。
Of all the information they collected, their analysis only used participants' fathers' age— to "control for variation in baseline age across participants".

在他們收集的所有信息中，他們的分析僅使用爸爸輩年齡的試驗者——來「控制參與者之間基準年齡的變化」。
They also paused their experiment after every ten participants, and continued if the p-value was above .05, but stopped when it dipped below .05.

他們還在每 10 個試驗者進行之後暫停實驗，如果 p 值高於 0.05 則繼續，但低於 0.05 時就會停止。
They found that participants who heard one song were 1.5 years younger than those who heard the other song, with a p-value of .04.

他們發現聽到一首歌的試驗者比聽到另一首歌的試驗者年輕 1.5 歲，p 值為 0.04。
Usually it's much tougher to spot p-hacking, because we don't know the results are impossible: the whole point of doing experiments is to learn something new.

通常發現 p 值駭客很困難，因為我們並不知道結果是不可能的：整個做實驗的意義就在於學習新的東西。
Fortunately, there's a simple way to make p-values more reliable: pre-registering a detailed plan for the experiment and analysis beforehand that others can check, so researchers can't keep trying different analyses until they find a significant result.

幸運的是，有一種簡單的方法可以讓 p 值更可靠：為實驗和分析預先紀錄一個詳細的計劃，以便其他人可以檢查，這樣研究人員就不能繼續嘗試不同的分析，直到他們找到重要的結果。
And, in the true spirit of scientific inquiry, there's even a new field that's basically science doing science on itself: studying scientific practices in order to improve them.

而且，本著科學探究的真正精神，甚至還出現一個新的領域，基本上是對科學本身做科學：研究科學實驗以改善它們。
This new field has emerged in response to a crisis in science, and p-hacking is just one part of that crisis. So, what's going on? And can we fix it? Learn more with this video.

這個新領域的出現是為了應對科學危機，而 p 值駭客只是這危機的一部分，那麼，到底發生什麼事了？我們可以修復它嗎？請觀看這支影片了解更多內容。