Placeholder Image

字幕列表 影片播放

  • Erez Lieberman Aiden: Everyone knows

    Erez Lieberman Aiden:大家都知道

  • that a picture is worth a thousand words.

    一張圖勝過千言萬語

  • But we at Harvard

    但我們在哈佛時

  • were wondering if this was really true.

    卻在思考這道理是否真是如此

  • (Laughter)

    (笑聲)

  • So we assembled a team of experts,

    所以我們由來自哈佛大學

  • spanning Harvard, MIT,

    麻省理工學院

  • The American Heritage Dictionary, The Encyclopedia Britannica

    美國傳統英語詞典,大英百科全書

  • and even our proud sponsors,

    甚至我們偉大的贊助商─Google的專家們

  • the Google.

    組成一個團隊

  • And we cogitated about this

    我們花了四年的時間

  • for about four years.

    在思考這個問題

  • And we came to a startling conclusion.

    然後我們得到了一個驚人的結論

  • Ladies and gentlemen, a picture is not worth a thousand words.

    女士先生們,一張圖片其實不只勝過千言萬語

  • In fact, we found some pictures

    事實上,我們發現某些圖片

  • that are worth 500 billion words.

    更是勝過五千億個字

  • Jean-Baptiste Michel: So how did we get to this conclusion?

    Jean-Baptiste Michel:我們是如何得出這項結論的呢?

  • So Erez and I were thinking about ways

    Erez和我思考了不同的方式

  • to get a big picture of human culture

    想更加了解人類文化

  • and human history: change over time.

    以及人類歷史從古到今的變化的全景

  • So many books actually have been written over the years.

    事實上,多年來已經出版了許多書籍。

  • So we were thinking, well the best way to learn from them

    所以我們認為最好的學習方式

  • is to read all of these millions of books.

    就是將這上百萬的書全讀過一遍

  • Now of course, if there's a scale for how awesome that is,

    如果能有一個尺規來說明此舉的驚人程度

  • that has to rank extremely, extremely high.

    這將會相當驚人

  • Now the problem is there's an X-axis for that,

    但問題是這裡的X軸

  • which is the practical axis.

    是表示實用程度

  • This is very, very low.

    這相當不實用

  • (Applause)

    (掌聲)

  • Now people tend to use an alternative approach,

    現在人們希望用別的方式

  • which is to take a few sources and read them very carefully.

    可以讀少一點書,但讀得非常仔細

  • This is extremely practical, but not so awesome.

    這會相當實用,但這一點都不吸引人

  • What you really want to do

    我們真正想做的是

  • is to get to the awesome yet practical part of this space.

    要用一種吸引人且實用的方法來閱讀這些書

  • So it turns out there was a company across the river called Google

    所以在河的對岸有間公司叫做Google

  • who had started a digitization project a few years back

    他們幾年之前開始了一項數字化計畫

  • that might just enable this approach.

    這項計畫讓我們能實踐剛說的方法

  • They have digitized millions of books.

    他們已將數百萬本書給數位化

  • So what that means is, one could use computational methods

    這意味著,我們可以透過電腦

  • to read all of the books in a click of a button.

    簡單按個按鈕就能閱讀所有的書

  • That's very practical and extremely awesome.

    這非常實用而且相當棒

  • ELA: Let me tell you a little bit about where books come from.

    ELA:讓我為各位介紹這些書都來自何方

  • Since time immemorial, there have been authors.

    自古以來,有非常多作家

  • These authors have been striving to write books.

    這些作家一直努力寫作

  • And this became considerably easier

    但現在寫作變得相當容易

  • with the development of the printing press some centuries ago.

    這歸功於幾世紀前印刷術的革新

  • Since then, the authors have won

    自那時起作家們

  • on 129 million distinct occasions,

    能在一億兩千九百萬個不同的地方

  • publishing books.

    出版書籍

  • Now if those books are not lost to history,

    如果那些書沒有因為時代交替而遺失

  • then they are somewhere in a library,

    那麼那些書可能在某個圖書館的一處

  • and many of those books have been getting retrieved from the libraries

    有相當多書可以從圖書館中被借閱

  • and digitized by Google,

    由Google將其數位化

  • which has scanned 15 million books to date.

    迄今Google已經掃描了一千五百萬本書

  • Now when Google digitizes a book, they put it into a really nice format.

    Google將一本書數位化,並以優良的型式呈現

  • Now we've got the data, plus we have metadata.

    現在我們有了這些數據,加上這些詮釋資料

  • We have information about things like where was it published,

    我們有了相關的資訊,比如出版地區,

  • who was the author, when was it published.

    作者,出版時間

  • And what we do is go through all of those records

    我們所做的就是透過這些記錄

  • and exclude everything that's not the highest quality data.

    並剔除不是最精華的資料

  • What we're left with

    我們後來得到的是

  • is a collection of five million books,

    五百萬本書

  • 500 billion words,

    五千億個詞

  • a string of characters a thousand times longer

    這是一串比人類基因組

  • than the human genome --

    還要長上一千倍的字符

  • a text which, when written out,

    如果寫成文章

  • would stretch from here to the Moon and back

    將會是從這裡到月球來回距離

  • 10 times over --

    的十倍以上

  • a veritable shard of our cultural genome.

    這是我們文化基因名副其實的的一部分

  • Of course what we did

    當然當我們面臨

  • when faced with such outrageous hyperbole ...

    如此誇張的情況時

  • (Laughter)

    (笑聲)

  • was what any self-respecting researchers

    我們也跟每一位有自尊心的研究人員一樣

  • would have done.

    會做相同的事

  • We took a page out of XKCD,

    我們也和四格漫畫一樣

  • and we said, "Stand back.

    我們決定「等等

  • We're going to try science."

    我們要用科學的方式來處理。」

  • (Laughter)

    (笑聲)

  • JM: Now of course, we were thinking,

    JM:當然,我們在思考

  • well let's just first put the data out there

    首先我們先把資料提取出來

  • for people to do science to it.

    讓其他人以科學的方式去分析

  • Now we're thinking, what data can we release?

    現在我們在思考,我們能發行何種數據?

  • Well of course, you want to take the books

    當然,我們想拿這些書

  • and release the full text of these five million books.

    將這五百萬本書的內容全部釋出

  • Now Google, and Jon Orwant in particular,

    現在Google,特別是Jon Orwant

  • told us a little equation that we should learn.

    告訴我們一個我們該注意的小方程式

  • So you have five million, that is, five million authors

    我們有五百萬本書,也就是有五百萬名作者

  • and five million plaintiffs is a massive lawsuit.

    而五百萬名原告是一場龐大的訴訟

  • So, although that would be really, really awesome,

    雖然這個過程是相當地驚人

  • again, that's extremely, extremely impractical.

    但這還是極度的不切實際

  • (Laughter)

    (笑聲)

  • Now again, we kind of caved in,

    然後,我們似乎有點妥協

  • and we did the very practical approach, which was a bit less awesome.

    我們試了比較實際的方式,這方法不怎麼吸引人

  • We said, well instead of releasing the full text,

    我們認為,與其釋出全部的書籍資料

  • we're going to release statistics about the books.

    我們選擇將這些書的數據資料給呈現出來

  • So take for instance "A gleam of happiness."

    舉個例子「幸福的光」

  • It's four words; we call that a four-gram.

    這是四個字,我們稱做「四字詞」

  • We're going to tell you how many times a particular four-gram

    我們要告訴各位一個特定的四字詞

  • appeared in books in 1801, 1802, 1803,

    從1801,1802,1803年開始出現在書本裡

  • all the way up to 2008.

    直到2008年

  • That gives us a time series

    這給我們一個時間軸來了解

  • of how frequently this particular sentence was used over time.

    這些特定的字句從過去到現在的使用頻率

  • We do that for all the words and phrases that appear in those books,

    我們計算了所有出現在這些書中的字詞

  • and that gives us a big table of two billion lines

    彙整出的資料畫出了二十億條曲線

  • that tell us about the way culture has been changing.

    這告訴了我們文化是如何改變的

  • ELA: So those two billion lines,

    ELA:這二十億條曲線

  • we call them two billion n-grams.

    我們稱為二十億組詞

  • What do they tell us?

    這告訴了我們

  • Well the individual n-grams measure cultural trends.

    每一組詞代表了不同的文化趨勢

  • Let me give you an example.

    讓我舉個例子

  • Let's suppose that I am thriving,

    假設我做了件不得了的事

  • then tomorrow I want to tell you about how well I did.

    明天我要告訴你是多不得了

  • And so I might say, "Yesterday, I throve."

    我可能會說「"Yesterday, I throve."」

  • Alternatively, I could say, "Yesterday, I thrived."

    或者,我也可以說「"Yesterday, I thrived."」

  • Well which one should I use?

    但我應該說哪一種呢?

  • How to know?

    要怎麼知道

  • As of about six months ago,

    大概在六個月前

  • the state of the art in this field

    要知道這一領域最尖端的方法

  • is that you would, for instance,

    你可能得要去詢問

  • go up to the following psychologist with fabulous hair,

    一位有著時髦髮型的心理學家

  • and you'd say,

    你可能會問

  • "Steve, you're an expert on the irregular verbs.

    「史蒂夫,你是不規則動詞的專家。

  • What should I do?"

    我該怎麼說呢?」

  • And he'd tell you, "Well most people say thrived,

    而他會告訴你「嗯,大部分的人會說"thrive"

  • but some people say throve."

    但有些人會說"throve"。」

  • And you also knew, more or less,

    而你也或多或少知道

  • that if you were to go back in time 200 years

    如果我們回到兩百年前

  • and ask the following statesman with equally fabulous hair,

    去問一位同樣也有時髦髮型的政治家

  • (Laughter)

    (笑聲)

  • "Tom, what should I say?"

    「湯姆,我應該怎麼說呢?」

  • He'd say, "Well, in my day, most people throve,

    他說「嗯,在我的年代,大部份的人說"throve",

  • but some thrived."

    但少部分的人說"thrived"」

  • So now what I'm just going to show you is raw data.

    現在我要向各位展示原始數據

  • Two rows from this table of two billion entries.

    這二十億條目資料中的其中兩條數據

  • What you're seeing is year by year frequency

    各位將會看到的是"thrived"和"throve"兩個字

  • of "thrived" and "throve" over time.

    在各年時期的出現頻率

  • Now this is just two

    這只是二十億筆資料中

  • out of two billion rows.

    其中兩個詞條的資訊

  • So the entire data set

    這全部的數據資料

  • is a billion times more awesome than this slide.

    將會比此張投影片還要驚人億萬倍

  • (Laughter)

    (笑聲)

  • (Applause)

    (掌聲)

  • JM: Now there are many other pictures that are worth 500 billion words.

    JM:還有其他圖片也具有五千億字的價值

  • For instance, this one.

    例如這張

  • If you just take influenza,

    如果談到感冒

  • you will see peaks at the time where you knew

    從這幾個高峰點我們可以知道

  • big flu epidemics were killing people around the globe.

    感冒病毒的大流行在全球造成人類死亡

  • ELA: If you were not yet convinced,

    ELA:如果各位還不太相信

  • sea levels are rising,

    其他像是海平面升高

  • so is atmospheric CO2 and global temperature.

    大氣中的二氧化碳和全球暖化

  • JM: You might also want to have a look at this particular n-gram,

    JM:你也許會想看看這組特別的詞組

  • and that's to tell Nietzsche that God is not dead,

    「告訴尼采,上帝還沒死」

  • although you might agree that he might need a better publicist.

    也許你可能還會認為,他可能需要一個更好的公關

  • (Laughter)

    (笑聲)

  • ELA: You can get at some pretty abstract concepts with this sort of thing.

    ELA:從這當中,各位也能獲得一些相當抽象的概念

  • For instance, let me tell you the history

    例如,讓我跟各位說說

  • of the year 1950.

    有關「1950年」的歷史

  • Pretty much for the vast majority of history,

    幾乎在絕大多數的歷史裡

  • no one gave a damn about 1950.

    沒有特別談論1950這一年

  • In 1700, in 1800, in 1900,

    在1700年,在1800年,1900年

  • no one cared.

    沒有人在乎

  • Through the 30s and 40s,

    甚至到30年代和40年代

  • no one cared.

    也沒有人在談論

  • Suddenly, in the mid-40s,

    突然到了40年代中期

  • there started to be a buzz.

    開始出現了風潮

  • People realized that 1950 was going to happen,

    人們意識到1950年就要來臨

  • and it could be big.

    這是件大事

  • (Laughter)

    (笑聲)

  • But nothing got people interested in 1950

    但也沒有因此讓大眾對該年份產生興趣

  • like the year 1950.

    像是「那1950年」

  • (Laughter)

    (笑聲)

  • People were walking around obsessed.

    人們開始對這一年著迷

  • They couldn't stop talking

    大家無法停止談論

  • about all the things they did in 1950,

    有關他們在1950年所做的一切

  • all the things they were planning to do in 1950,

    所有他們計畫要在1950年所做的事

  • all the dreams of what they wanted to accomplish in 1950.

    所有他們要在1950年完成的夢想

  • In fact, 1950 was so fascinating

    事實上,1950年跟往後幾年相較

  • that for years thereafter,

    是相當迷人的一年

  • people just kept talking about all the amazing things that happened,

    人們不停談論所有發生在

  • in '51, '52, '53.

    '51,'52,'53年的驚奇事件

  • Finally in 1954,

    直到1954年

  • someone woke up and realized

    有人驚覺而且意識到

  • that 1950 had gotten somewhat passé.

    1950年已經變得過時了

  • (Laughter)

    (笑聲)

  • And just like that, the bubble burst.

    這一切就像泡沫破滅一樣

  • (Laughter)

    (笑聲)

  • And the story of 1950

    1950年的情況

  • is the story of every year that we have on record,

    其實就是我們數據上每一個年份的情況一樣

  • with a little twist, because now we've got these nice charts.

    稍微編排一下,我們有這些精美的圖表

  • And because we have these nice charts, we can measure things.

    因為有這些不錯的圖表,我們就能計算

  • We can say, "Well how fast does the bubble burst?"

    我們可以了解「風潮消逝的速度是多快?」

  • And it turns out that we can measure that very precisely.

    結果就是我們能很精確測量出一份數據

  • Equations were derived, graphs were produced,

    有了方程式,也有圖表

  • and the net result

    最終的結果就是

  • is that we find that the bubble bursts faster and faster

    談論年份的風潮一年比一年

  • with each passing year.

    消退的更快

  • We are losing interest in the past more rapidly.

    我們對於過去的興趣日漸消逝

  • JM: Now a little piece of career advice.

    JM:這張圖是有關職業建議

  • So for those of you who seek to be famous,

    對於那些想成名的人

  • we can learn from the 25 most famous political figures,

    我們可以知道二十五位最有名的政治人物

  • authors, actors and so on.

    作家、演員等等

  • So if you want to become famous early on, you should be an actor,

    如果各位想在年輕時就成名,那麼各位應該要當演員

  • because then fame starts rising by the end of your 20s --

    因為你的名氣會從二十歲後開始累積

  • you're still young, it's really great.

    那時正值青春年華,會相當不錯

  • Now if you can wait a little bit, you should be an author,

    如果各位有耐心一點,那麼就應該當個作家

  • because then you rise to very great heights,

    因為各位就能攀上高峰

  • like Mark Twain, for instance: extremely famous.

    成為像是馬克吐溫這樣有名望的作家

  • But if you want to reach the very top,

    但如果各位想攀上最頂尖的位置

  • you should delay gratification

    就得延後滿足自己的慾望

  • and, of course, become a politician.

    然後當一位政治家

  • So here you will become famous by the end of your 50s,

    那麼各位會在五十歲過後開始成名

  • and become very, very famous afterward.

    然後你的名氣會在未來持續延續

  • So scientists also tend to get famous when they're much older.

    科學家也往往是在老年時才成名

  • Like for instance, biologists and physics

    而生物學家和物理學家一樣

  • tend to be almost as famous as actors.

    往往也是和演員一樣著名

  • One mistake you should not do is become a mathematician.

    唯一不要做的職業就是變成數學家

  • (Laughter)

    (笑聲)

  • If you do that,

    如果各位真要做這行

  • you might think, "Oh great. I'm going to do my best work when I'm in my 20s."

    各位可能會想「太好了,當我在二十多歲時,我會盡一切努力。」

  • But guess what, nobody will really care.

    但事實上,沒人會真正去在乎你所做的事

  • (Laughter)

    (笑聲)

  • ELA: There are more sobering notes

    ELA:在我們的資料裡

  • among the n-grams.

    還有其他更發人省思的紀錄

  • For instance, here's the trajectory of Marc Chagall,

    例如馬克‧夏卡爾的名字出現的頻率軌跡

  • an artist born in 1887.

    夏卡爾是位1887年出生的藝術家

  • And this looks like the normal trajectory of a famous person.

    這看起來是一位名人名字正常出現在書中的軌跡

  • He gets more and more and more famous,

    他的名氣日益響亮

  • except if you look in German.

    但如果看德國的數據就不是如此

  • If you look in German, you see something completely bizarre,

    如果看德國的數據,會看到某部份是非常奇怪的

  • something you pretty much never see,

    這是幾乎不太可能看到的

  • which is he becomes extremely famous

    就是他變得非常有名

  • and then all of a sudden plummets,

    卻突然在1933年至1945年間

  • going through a nadir between 1933 and 1945,

    聲勢跌落谷底

  • before rebounding afterward.

    又反彈回升

  • And of course, what we're seeing

    當然我們看的出來

  • is the fact Marc Chagall was a Jewish artist

    這是因為馬克‧夏卡爾是一位猶太裔藝術家

  • in Nazi Germany.

    當時德國是納粹統治

  • Now these signals

    這些指標

  • are actually so strong

    事實上相當明確

  • that we don't need to know that someone was censored.

    我們不需要知道有人在審查書籍

  • We can actually figure it out

    我們能運用基本的信號運算方式

  • using really basic signal processing.

    實際了解當時狀況

  • Here's a simple way to do it