Placeholder Image

字幕列表 影片播放

  • Hey, Vsauce. Michael here. About 6 percent of everything you say and read and write is

    嘿,Vsauce的各位,我是Michael。所有你說的、讀的、寫的占大約百分之六的字,

  • is the...

    就是...

  • "the" - is the most used word in the English language. About one out of every 16 words we encounter on a daily basis is "the."

    "The" 在英語中是最常被使用到的字。在我們日常生活中所碰到的字裡,大約每 16 個字就會出現一次THE

  • The top 20 most common English words

    前 20 個最常使用到的英文單字依照順序是

  • in order are "the," "of," "and," "to," "a," "in," "is," "I," "that," "it," "for," "you,"

  • "was," "with," "on," "as," "have," "but," "be," "they."

  • That's a fun fact. A piece of trivia but it's also more. You see, whether the most commonly used words are ranked across an entire language,

    這是件有趣的事實。雖然只是件小事,但是這還說明了更多事情呢!你看,最常使用的單字不管是列在一個語言中,

  • or in just one book or article, almost every time a bizarre pattern emerges.

    或者只是在一本書或文章中,幾乎每次都會發生一個奇怪的現象。

  • The second most used word will appear about half as often as the most used.

    第二個最常使用單字的使用頻率,大約是第一個最常使用單字使用頻率的一半;

  • The third one third as often. The fourth one fourth as often. The fifth one fifth as often.

    第三個最常單字的使用頻率,通常就是第一個最常使用單字的1/3;第四個最常單字的使用頻率,通常就是第一個最常使用單字的1/4;第五個最常單字的使用頻率,通常就是第一個最常使用單字的1/5;

  • The sixth one sixth as often, and so on all the way down.

    第六個最常單字的使用頻率,通常就是第一個最常使用單字的1/6,依此類推。

  • Seriously. For some reason, the amount of times a word is used is just proportional to one over its rank.

    我很認真。因為某些原因,單字的使用次數和單字的排名的倒數成正比。

  • Word frequency and ranking on a log log graph follow a nice straight line.

    單字使用頻率和排名在對數關係圖中,呈現非常好看的直線。

  • A power-law. This phenomenon is called Zipf's Law

    也就是呈現冪次定律。這種現象就是齊夫定律。

  • and it doesn't only apply to English. It also applies to other languages, like, well, all of them.

    而齊夫定律不只能應用在英文。這也適用於其他語言,像是... 嗯... 全部都是。

  • Even ancient languages we haven't been able to translate yet.

    甚至連我們都還沒能翻譯的古老語言也都是這樣。

  • And here's the thing. We have no idea why. It's surprising that something as complex as reality

    重點是,我們不知道為什麼。一個和真實世界一樣複雜的現象

  • should be conveyed by something as creative as language in such a predictable way.

    居然可以用和語言一樣有創意的一種預測方法來解釋,這是很令人吃驚的!

  • How predictable? Well, watch this. According to WordCount.org,

    是多麼可預測呢?恩,看一下這個。根據 WordCount.org,

  • which ranks words as found in the British National Corpus, "sauce" is the

    這個組織是用英國國家辭庫找到的單字來排名,sauce是

  • 5,555th most common English word. Now, here is a list of how many times

    第5,555名最常使用到的英語單字。現在,這是每個字在維基百科上、

  • every word on Wikipedia and in the entire Gutenberg Corpus of tens of

    在 Gutenberg 辭庫中、還有數以萬計公共圖書的單字使用頻率清單。

  • thousands of public domain books shows up. The most used word, "the", shows up about

    最常使用的 The 出現了大概181百萬次。

  • 181 million times. Knowing these two things, we can estimate that the word

    目前我們知道兩件事,我們可以估計單字

  • "sauce" should appear about thirty thousand times on Wikipedia and

    「sauce」應該會在維基百科和 Gutenberg 辭庫出現大概 30,000 次

  • Gutenberg combined. And it pretty much does.

    這是非常有可能的。

  • What gives? The world is chaotic. Things are distributed in myriad of ways, not just

    這代表什麼?世界很混亂的。事物以各種方法散布,不只是冪次定律。

  • power laws. And language is personal,

    語言是個人化、

  • intentional, idiosyncratic. What about the world and ourselves could cause such

    國際化、有異質性的。世界和我們是如何造成這種遵循基本法則的複雜行為活動呢?

  • complex activities and behaviors to follow such a basic rule? We literally

    我們基本上無從得知。

  • don't know. More than a century of research has yet to close the case.

    儘管超過一世紀的研究,仍無法破解這個謎團。

  • Moreover, Zipf's law doesn't just mysteriously describe word use. It's

    此外,齊夫定律不只神秘的描述了字詞的使用。

  • also found in city populations, solar flare intensities, protein sequences and

    我們也可以從齊夫定律中發現城市的人口組成、太陽烈焰的密集程度、蛋白質基因序列,

  • immune receptors, the amount of traffic websites get, earthquake magnitudes, the

    還有免疫感官器、網站的進出流量、地震的強度、

  • number of times academic papers are cited, last names, the firing patterns of

    學術論文被引用幾次、姓氏、神經網絡的觸發模式、

  • neural networks, ingredients used in cookbooks, the number of phone calls

    在食譜中使用到的食材、人們收到幾通電話、

  • people received, the diameter of Moon craters, the number of people that die

    月球隕石坑的直徑、多少人死於戰爭、

  • in wars, the popularity of opening chess moves, even the rate at which we forget.

    西洋棋開場的受歡迎程度,甚至是我們遺忘的速率。

  • There are plenty of theories about why language is 'zipf-y,' but no firm conclusions

    有非常多理論解釋了語言為何遵循齊夫定律,但是都沒有有力的結論,

  • and this video doesn't contain a definite explanation either. Sorry, I know

    而這個影片同樣也沒有包含一個確切的解釋。很抱歉,我知道那會是個無賴,

  • that's a bummer, since we appear to like knowing more than mystery. But that said,

    因為我們不只是想知道謎團而已。但是這也就是說,

  • we also ask more than we answer. So let's dive into Zipf's ramifications, some

    我們問的了比回答的多。所以,我們就來更深入探討齊夫衍生物、一些相關的模型、

  • related patterns, some possible explanations and the depth of the

    一些可能的解釋,還有謎團更深入的層面。

  • mystery itself. Zipf's law was popularized by George Zipf,

    齊夫定律是因為George Zipf這個人而廣為人知,

  • a linguist at Harvard University. It is a discrete form of the continuous Pareto

    他是個在哈佛大學的語言學家。這是個連續Pareto分布的離散形式,

  • distribution from which we get the Pareto Principle. Because so many

    由Pareto分布中,我們可以得到Pareto原理。因為很多

  • real-world processes behave this way, the Pareto Principle tells us that, as a rule

    真實世界的運作都是以此形式表現,Pareto原理告訴我們的第一原則是,

  • of thumb, it's worth assuming that 20% of the causes are responsible for 80% of

    20%的事件導致80%的結果,這是非常值得的假設。

  • the outcome,

  • like in language, where the most frequently used 18 percent of words

    像是在語言當中,最常使用的前18%的單字中,就已經占了所有出現字詞的80%。

  • account for over 80% of word occurrences. In 1896, Vilfredo Pareto showed that

    在1896年,Vilfredo Pareto展示了,

  • approximately 80% of the land in Italy was owned by just twenty percent of the

    80%的義大利國土,被20%的義大利人口所擁有。

  • population. It is said that he later noticed in his garden 20 percent of his

    據說,他還注意到在他的花園中,

  • pea pods contained eighty percent of the peas. He and other researchers looked at

    20%的豌豆莢,涵蓋了80%的豌豆。他和其他研究者觀察了其他資料集,

  • other datasets and found that this 80-20 imbalance comes up a lot in the world.

    並且發現80/20法則在全世界處處可見。

  • The richest 20% of humans have 82.7% of the world's income. In the US, 20% of

    最富有的前20%的人類,擁有世界82.7%的財產。在美國,

  • patients use eighty percent of health care resources. In 2002, Microsoft

    20%的病人就用了80%的醫療資源。在2002年,

  • reported that 80% of the errors and crashes in Windows and Office are caused

    微軟回報了在Windows作業系統和Office的80%的錯誤,是由20%的程式臭蟲引起的。

  • by 20% of the bugs detected. A common rule of thumb in the business world

    在商場上,有個普遍的第一指導原則聲稱,

  • states that 20% of your customers are responsible for 80% of your profits and

    你20%的顧客會為你帶來80%的營收,

  • eighty percent of the complaints you receive will come from 20% of your

    還有你收到80%的抱怨大都是來自你20%的顧客。

  • customers. A book titled "The 80/20 Principle" even says that in a home or

    有本書叫「80/20法則」,當中甚至提到在家或辦公室中,

  • office,

  • 20% of the carpet receives 80 percent of the wear. Oh, and as Woody Allen famously

    20%的毛毯覆蓋了80%的室內表面積。喔,對了!因為Woody Allen曾說了句有名的話,

  • said, "eighty percent of success is just showing up." The Pareto Principle is

    「只有80%的成功會體現出來」。Pareto原則四處都能觀察到,

  • everywhere, which is good.

    這樣真好。

  • By focusing on just 20 percent of what's wrong, you can often expect to solve

    只要專注在20%發生的事情,你經常就可以預期你能夠解決80%的問題。

  • eighty percent of the problems. A variety of different unrelated factors cause

    雖然導致這種80/20法則的不相關因素有很多種情況,

  • this to be true from case to case, but if we can get to the bottom of what causes

    但是我們可以深入探究一些案例的成因。

  • some of them,

  • maybe we'll find that one or more of those mechanisms is responsible for

    或許我們會發現導致語言中齊夫定律的一個或多個機制。

  • Zipf's law in language. George Zipf himself thought languages' interesting rank

    George Zipf他個人認為語言中有趣的使用頻率排名,

  • frequency distribution was a consequence of the Principle of Least Effort.

    是因為最少努力原則所導致的結果。

  • The tendency for life and things to follow the path of least resistance. Zipf believed

    那是一個萬物都會遵循最小阻力路徑的傾向。Zipf相信,

  • it drove much of human behavior and hypothesized that as language developed

    它驅動了大部分人類的行為,並且假設當在我們種族中語言發展時,

  • in our species, speakers naturally preferred drawing from as few words as

    說話者傾向盡可能使用少一點的文字,

  • possible to get their thoughts out there. It was easier. But in order to understand

    來表達出他們的想法。這就簡單了。但是為了理解剛剛說的,

  • what was being said,

  • listeners preferred larger vocabularies that gave more specificity, so that they

    聽者會選擇比較大的詞庫,來增加文意的明確性,這樣一來,

  • had to do less work. The compromise between listening and speaking, Zipf felt,

    他們可以做少一點事情。Zipf 覺得在聽與說之間的妥協,

  • led to the current state of language. A few words are used often and many many

    導致了現今的語言發展。少一點的字會比較常用到,很多很多

  • many words are used rarely.

    很多的字就比較少用到。

  • Recent papers have suggested that having a few short, often used, predictable words

    最近的論文指出,使用或少或短、常用或可以被預測的文字,

  • helps dissipate information load density on listeners, spacing out important vocab

    可以幫助聽者減少接收訊息的負擔,留一些心思來處理重要詞彙,

  • so that the information rate is more constant. This makes sense and much has

    這樣一來,接收訊息的速度就會比較穩定。這非常合理,

  • been learned by applying the least effort principle to other behaviors, but

    我們還從用到最少努力原則的其它行為學到更多事情。但是,

  • later researchers argued that for language, the explanation was even more

    在這之後的研究者爭論著,對於語言,解釋甚至就更加簡單了。

  • simple. Just a few years after Zipf's seminal paper, Benoit Mandelbrot showed

    在Zipf的研討會論文發表後的幾年,Benoit Mandelbrot演示了,

  • that there may be nothing mysterious about Zipf's law at all, because even if you

    齊夫定律一點都不神奇。因為就算你在鍵盤上

  • just randomly type on a keyboard you will produce words distributed according

    隨便打字,你也會產生一堆根據齊夫定律分布的文字。

  • to Zipf's law. It's a pretty cool point and this is why it happens. There are

    這是非常酷的論點,這也是為什麼它發生了。

  • exponentially more different long words than short words. For instance, the English

    目前又臭又長的相異字比短字還要多很多。舉個例子,英文字母

  • alphabet can be used to make 26 one letter words, but 26 squared 2 letter

    可以被用來組成26個一個字母的字,但是組成2個字母的字就有26平方個組合。

  • words. Also, in random typing, whenever the space bar is pressed a word terminates.

    此外,在隨機打字時,只要按下空白鍵,就是結束一個字。

  • Since there's always a certain chance that the space bar will be pressed longer

    因為空白鍵總是有機會被按下,

  • stretches of time before it happens

    在結束字詞之前的那段較長的時間,

  • are exponentially less likely than shorter ones. The combination of these

    遠遠低於較短的字。

  • exponentials is pretty 'Zipfy.' For example, if all 26 letters and the

    這些指數的綜合結果會非常「齊夫」。舉個例子,如果所有26個字母,還有空白鍵

  • spacebar are equally likely to be typed, after a letter is typed and a word has

    它們被按下的機率都相同,只要一個字母被按下,就代表一個字詞開始,

  • begun, the probability that the next input will be a space, thus creating a

    若下一個輸入的鍵是空白,而產生一個字的機率,

  • one letter word, is just one in 27. And sure enough, if you randomly generate

    就僅僅是1/27。我們可以非常肯定的說,如果你隨機產生字母,

  • characters or hire a proverbial typing monkey, about one out of every 27 or 3.7

    或者雇用了一個俗話說的打字猴,那麼大約1/27或者3.7%在空白之間的東西,

  • percent of the stuff between spaces, will be single letters. Two letter words

    將會是單一字母。兩個字母的字出現,

  • appear when after beginning a word any character but the space bar is hit - a 26

    只會在開始打字,而第二個字元不是空白時發生。

  • in 27 chance and then the space bar.

    發生機會也就是26/27,然後再按下空白。

  • A three-letter word is the probability of a letter, another letter and then a

    三個字母的字就是按一個字母、另一個字母,最後按下空白。

  • space. If we divide by the number of unique words of each length there can be,

    如果我們把這些相異字除以相同長度的所有相異字詞的數量,

  • we get the frequency of occurrence expected for any particular word given

    我們就可以從特定長度得知發生頻率。

  • its length. For example, the letter V will make up about 0.142 percent of

    舉例來說,單一個字V,將會組成大約隨機打字字詞數量的0.142%

  • random typing. The word "Vsauce" 0.0000000993 percent. Longer words are

    字詞Vsauce將會占0.0000000993%,更長的字就更不可能,

  • less likely, but watch this. Let's spread these frequencies out according to the

    但是看一下這個。讓我們將這些字的使用次數依照最常使用的排名列出來。

  • ranks they'd take up on a most often used list. There are 26 possible one

    因為有26個可能的單一字母字詞,

  • letter words, so each of the top 26 ranked words are expected to occur

    所以每個排名前26名的字就是會被預期這麼常發生。

  • about this often. The next 676 ranks will be taken up

    其他676個2個字母合成字會是這樣。

  • by two letter words that show up about this often. If we extend each frequency

    如果我們依據字詞有幾個字母來延伸每個字的使用頻率,

  • according to how many members it has, we get Zipf. Subsequent researchers have

    我們就會得到齊夫定律。之後的研究人員也已經

  • detailed how changing up the initial conditions can smooth the steps out.

    詳細描述了如何改變初始狀態,使得趨勢線更加圓滑。

  • Our mysterious distribution has been created out of nothing but the inevitabilities

    我們神奇的分布圖,不外乎就只是由數學中不可避免的特性而被創造出來。

  • of math.

  • So maybe there is no mystery. Maybe words are just the result of humans randomly

    所以或許根本沒有什麼謎團。或許字詞就只是人類隨機將外在和內心的世界分割成語言標籤所造成的結果。

  • segmenting the observable world and the mental world into labels and Zipf's law

    而且齊夫定律描述了你當做了什麼,某事情也就會自然發生。

  • describes what naturally happens when you do that. Case closed. and as always

    案情偵結。如同以往,

  • And as always,

    如同以往,

  • thanks for... wait a minute!

    感謝您...等等!

  • Actual language is very different from random typing. Communication is

    真實世界的語言和隨機打字是非常不同的。溝通

  • deterministic to a certain extent. Utterances and topics arrive based on

    在某種程度上是具有決定性的。口語表達和話題都是根據之前說了什麼事情而決定的。

  • what was said before. And the vocabulary we have to work with certainly isn't the

    我們用的詞語當然不是單純的隨機命名。

  • result of purely random naming. For example, the monkey typing model can't

    舉例來說,猴子打字的模式,

  • explain why even the names of the elements, the planets and the days of the

    並不能解釋為什麼連元素名稱、行星名稱和星期幾的名稱

  • week are used in language according to Zipf's law. Sets like these are constrained

    在語言中的使用也是依照齊夫定律。像是這樣的資料文集,都是取決於自然世界,

  • by the natural world and they're not the result of us randomly segmenting the

    他們並不是因為我們隨機將世界分割成語言標籤的結果。

  • world into labels. Furthermore, when given a list of novel words, words they've

    此外,當我們有一份小說所使用到的字詞的清單,還有一份小說從未使用到的字詞清單,

  • never heard or used before, like when prompted to write a story about alien

    還有像是我們被驅使著去寫一個外星生物的故事,

  • creatures with strange names, people will naturally tend to use the name of one

    人們自然會傾向去用相同的外星人名字2次,

  • alien twice as often as another, three times as often as another... Zipf's law appears to

    而不是1次;用相同的外星人名字3次而不是2次。齊夫定律

  • be built into our brains. Perhaps there is something about the way thoughts and

    本來就深植在我們的腦中。這或許有某種機制存在,將人類思路

  • topics of discussion ebb and flow that contributes to Zipf's law.

    和消長的討論主題歸納到齊夫定律之中。

  • Another way 'Zipf-ian' distributions occur is via processes that change

    另一個齊夫分布的發生,

  • according to how they've previously operated. These are called preferential

    是因為流程會根據它們先前被操作的狀態而改變。這就是所謂的有傾向的

  • attachment processes. They occur when something - money, views,

    依戀流程。它們會發生,在金錢、視野、

  • attention, variation, friends, jobs, anything really is given out according

    注意力、變化性、朋友、工作,任何可以根據迷戀程度的多寡來決定的事物。

  • to how much is already possessed. To go back to the carpet example, if most

    回到地毯的例子,

  • people walk from the living room to the kitchen across a certain path, furniture

    如果大多數人都用某種路線從客廳走到廚房,

  • will be placed elsewhere, making that path even more popular. The more views

    家具就會被放在路線以外的其他地方,使得行走路線更加通暢。越多人觀看

  • a video or image or post has, the more likely it is to get recommended

    一個影片、圖片或貼文,那這些影片、貼文就更有可能被自動推薦,

  • automatically or make the news for having so many views, both of which give

    或因為瀏覽人次眾多而被新聞報導,而這兩種方式都會使得它有更多瀏覽人次。

  • it more views.

  • It's like a snowball rolling down a snowy hill. The more snow it accumulates, the

    這就像是從充滿雪的山丘向下滾雪球一般。累積越多雪,

  • bigger its surface area becomes for collecting more and the faster it grows.

    雪球表面就越大,因為它收集到更多雪也更快變得很大。

  • There doesn't have to be a deliberate choice driving a preferential attachment

    並不需要故意選擇來驅使傾向依戀流程發生。

  • process. It can happen naturally. Try this. Take a bunch of paper clips and grab any

    它可以自然地發生。試試看這個吧!拿一把迴紋針,從中任意抓取兩個出來,

  • two at random.

  • Link them together and then throw them back in the pile. Now, repeat over and

    將它們連結在一起,再把他丟到剛剛的那堆迴紋針,如此反覆再反覆。

  • over again. If you grab paper clips that are already part of a chain, link them anyway.

    如果你抓到那些迴紋針鍊的一部分,還是照樣把他們連起來。

  • More often than not after a while you will have a distribution that looks

    通常過一段時間後,你將會有一個分布圖看起來很齊夫。

  • 'Zipf-ian.' A small number of