字幕列表 影片播放
-
Erez Lieberman Aiden: Everyone knows
Erez Lieberman Aiden:大家都知道
-
that a picture is worth a thousand words.
一張圖勝過千言萬語
-
But we at Harvard
但我們在哈佛時
-
were wondering if this was really true.
卻在思考這道理是否真是如此
-
(Laughter)
(笑聲)
-
So we assembled a team of experts,
所以我們由來自哈佛大學
-
spanning Harvard, MIT,
麻省理工學院
-
The American Heritage Dictionary, The Encyclopedia Britannica
美國傳統英語詞典,大英百科全書
-
and even our proud sponsors,
甚至我們偉大的贊助商─Google的專家們
-
the Google.
組成一個團隊
-
And we cogitated about this
我們花了四年的時間
-
for about four years.
在思考這個問題
-
And we came to a startling conclusion.
然後我們得到了一個驚人的結論
-
Ladies and gentlemen, a picture is not worth a thousand words.
女士先生們,一張圖片其實不只勝過千言萬語
-
In fact, we found some pictures
事實上,我們發現某些圖片
-
that are worth 500 billion words.
更是勝過五千億個字
-
Jean-Baptiste Michel: So how did we get to this conclusion?
Jean-Baptiste Michel:我們是如何得出這項結論的呢?
-
So Erez and I were thinking about ways
Erez和我思考了不同的方式
-
to get a big picture of human culture
想更加了解人類文化
-
and human history: change over time.
以及人類歷史從古到今的變化的全景
-
So many books actually have been written over the years.
事實上,多年來已經出版了許多書籍。
-
So we were thinking, well the best way to learn from them
所以我們認為最好的學習方式
-
is to read all of these millions of books.
就是將這上百萬的書全讀過一遍
-
Now of course, if there's a scale for how awesome that is,
如果能有一個尺規來說明此舉的驚人程度
-
that has to rank extremely, extremely high.
這將會相當驚人
-
Now the problem is there's an X-axis for that,
但問題是這裡的X軸
-
which is the practical axis.
是表示實用程度
-
This is very, very low.
這相當不實用
-
(Applause)
(掌聲)
-
Now people tend to use an alternative approach,
現在人們希望用別的方式
-
which is to take a few sources and read them very carefully.
可以讀少一點書,但讀得非常仔細
-
This is extremely practical, but not so awesome.
這會相當實用,但這一點都不吸引人
-
What you really want to do
我們真正想做的是
-
is to get to the awesome yet practical part of this space.
要用一種吸引人且實用的方法來閱讀這些書
-
So it turns out there was a company across the river called Google
所以在河的對岸有間公司叫做Google
-
who had started a digitization project a few years back
他們幾年之前開始了一項數字化計畫
-
that might just enable this approach.
這項計畫讓我們能實踐剛說的方法
-
They have digitized millions of books.
他們已將數百萬本書給數位化
-
So what that means is, one could use computational methods
這意味著,我們可以透過電腦
-
to read all of the books in a click of a button.
簡單按個按鈕就能閱讀所有的書
-
That's very practical and extremely awesome.
這非常實用而且相當棒
-
ELA: Let me tell you a little bit about where books come from.
ELA:讓我為各位介紹這些書都來自何方
-
Since time immemorial, there have been authors.
自古以來,有非常多作家
-
These authors have been striving to write books.
這些作家一直努力寫作
-
And this became considerably easier
但現在寫作變得相當容易
-
with the development of the printing press some centuries ago.
這歸功於幾世紀前印刷術的革新
-
Since then, the authors have won
自那時起作家們
-
on 129 million distinct occasions,
能在一億兩千九百萬個不同的地方
-
publishing books.
出版書籍
-
Now if those books are not lost to history,
如果那些書沒有因為時代交替而遺失
-
then they are somewhere in a library,
那麼那些書可能在某個圖書館的一處
-
and many of those books have been getting retrieved from the libraries
有相當多書可以從圖書館中被借閱
-
and digitized by Google,
由Google將其數位化
-
which has scanned 15 million books to date.
迄今Google已經掃描了一千五百萬本書
-
Now when Google digitizes a book, they put it into a really nice format.
Google將一本書數位化,並以優良的型式呈現
-
Now we've got the data, plus we have metadata.
現在我們有了這些數據,加上這些詮釋資料
-
We have information about things like where was it published,
我們有了相關的資訊,比如出版地區,
-
who was the author, when was it published.
作者,出版時間
-
And what we do is go through all of those records
我們所做的就是透過這些記錄
-
and exclude everything that's not the highest quality data.
並剔除不是最精華的資料
-
What we're left with
我們後來得到的是
-
is a collection of five million books,
五百萬本書
-
500 billion words,
五千億個詞
-
a string of characters a thousand times longer
這是一串比人類基因組
-
than the human genome --
還要長上一千倍的字符
-
a text which, when written out,
如果寫成文章
-
would stretch from here to the Moon and back
將會是從這裡到月球來回距離
-
10 times over --
的十倍以上
-
a veritable shard of our cultural genome.
這是我們文化基因名副其實的的一部分
-
Of course what we did
當然當我們面臨
-
when faced with such outrageous hyperbole ...
如此誇張的情況時
-
(Laughter)
(笑聲)
-
was what any self-respecting researchers
我們也跟每一位有自尊心的研究人員一樣
-
would have done.
會做相同的事
-
We took a page out of XKCD,
我們也和四格漫畫一樣
-
and we said, "Stand back.
我們決定「等等
-
We're going to try science."
我們要用科學的方式來處理。」
-
(Laughter)
(笑聲)
-
JM: Now of course, we were thinking,
JM:當然,我們在思考
-
well let's just first put the data out there
首先我們先把資料提取出來
-
for people to do science to it.
讓其他人以科學的方式去分析
-
Now we're thinking, what data can we release?
現在我們在思考,我們能發行何種數據?
-
Well of course, you want to take the books
當然,我們想拿這些書
-
and release the full text of these five million books.
將這五百萬本書的內容全部釋出
-
Now Google, and Jon Orwant in particular,
現在Google,特別是Jon Orwant
-
told us a little equation that we should learn.
告訴我們一個我們該注意的小方程式
-
So you have five million, that is, five million authors
我們有五百萬本書,也就是有五百萬名作者
-
and five million plaintiffs is a massive lawsuit.
而五百萬名原告是一場龐大的訴訟
-
So, although that would be really, really awesome,
雖然這個過程是相當地驚人
-
again, that's extremely, extremely impractical.
但這還是極度的不切實際
-
(Laughter)
(笑聲)
-
Now again, we kind of caved in,
然後,我們似乎有點妥協
-
and we did the very practical approach, which was a bit less awesome.
我們試了比較實際的方式,這方法不怎麼吸引人
-
We said, well instead of releasing the full text,
我們認為,與其釋出全部的書籍資料
-
we're going to release statistics about the books.
我們選擇將這些書的數據資料給呈現出來
-
So take for instance "A gleam of happiness."
舉個例子「幸福的光」
-
It's four words; we call that a four-gram.
這是四個字,我們稱做「四字詞」
-
We're going to tell you how many times a particular four-gram
我們要告訴各位一個特定的四字詞
-
appeared in books in 1801, 1802, 1803,
從1801,1802,1803年開始出現在書本裡
-
all the way up to 2008.
直到2008年
-
That gives us a time series
這給我們一個時間軸來了解
-
of how frequently this particular sentence was used over time.
這些特定的字句從過去到現在的使用頻率
-
We do that for all the words and phrases that appear in those books,
我們計算了所有出現在這些書中的字詞
-
and that gives us a big table of two billion lines
彙整出的資料畫出了二十億條曲線
-
that tell us about the way culture has been changing.
這告訴了我們文化是如何改變的
-
ELA: So those two billion lines,
ELA:這二十億條曲線
-
we call them two billion n-grams.
我們稱為二十億組詞
-
What do they tell us?
這告訴了我們
-
Well the individual n-grams measure cultural trends.
每一組詞代表了不同的文化趨勢
-
Let me give you an example.
讓我舉個例子
-
Let's suppose that I am thriving,
假設我做了件不得了的事
-
then tomorrow I want to tell you about how well I did.
明天我要告訴你是多不得了
-
And so I might say, "Yesterday, I throve."
我可能會說「"Yesterday, I throve."」
-
Alternatively, I could say, "Yesterday, I thrived."
或者,我也可以說「"Yesterday, I thrived."」
-
Well which one should I use?
但我應該說哪一種呢?
-
How to know?
要怎麼知道
-
As of about six months ago,
大概在六個月前
-
the state of the art in this field
要知道這一領域最尖端的方法
-
is that you would, for instance,
你可能得要去詢問
-
go up to the following psychologist with fabulous hair,
一位有著時髦髮型的心理學家
-
and you'd say,
你可能會問
-
"Steve, you're an expert on the irregular verbs.
「史蒂夫,你是不規則動詞的專家。
-
What should I do?"
我該怎麼說呢?」
-
And he'd tell you, "Well most people say thrived,
而他會告訴你「嗯,大部分的人會說"thrive"
-
but some people say throve."
但有些人會說"throve"。」
-
And you also knew, more or less,
而你也或多或少知道
-
that if you were to go back in time 200 years
如果我們回到兩百年前
-
and ask the following statesman with equally fabulous hair,
去問一位同樣也有時髦髮型的政治家
-
(Laughter)
(笑聲)
-
"Tom, what should I say?"
「湯姆,我應該怎麼說呢?」
-
He'd say, "Well, in my day, most people throve,
他說「嗯,在我的年代,大部份的人說"throve",
-
but some thrived."
但少部分的人說"thrived"」
-
So now what I'm just going to show you is raw data.
現在我要向各位展示原始數據
-
Two rows from this table of two billion entries.
這二十億條目資料中的其中兩條數據
-
What you're seeing is year by year frequency
各位將會看到的是"thrived"和"throve"兩個字
-
of "thrived" and "throve" over time.
在各年時期的出現頻率
-
Now this is just two
這只是二十億筆資料中
-
out of two billion rows.
其中兩個詞條的資訊
-
So the entire data set
這全部的數據資料
-
is a billion times more awesome than this slide.
將會比此張投影片還要驚人億萬倍
-
(Laughter)
(笑聲)
-
(Applause)
(掌聲)
-
JM: Now there are many other pictures that are worth 500 billion words.
JM:還有其他圖片也具有五千億字的價值
-
For instance, this one.
例如這張
-
If you just take influenza,
如果談到感冒
-
you will see peaks at the time where you knew
從這幾個高峰點我們可以知道
-
big flu epidemics were killing people around the globe.
感冒病毒的大流行在全球造成人類死亡
-
ELA: If you were not yet convinced,
ELA:如果各位還不太相信
-
sea levels are rising,
其他像是海平面升高
-
so is atmospheric CO2 and global temperature.
大氣中的二氧化碳和全球暖化
-
JM: You might also want to have a look at this particular n-gram,
JM:你也許會想看看這組特別的詞組
-
and that's to tell Nietzsche that God is not dead,
「告訴尼采,上帝還沒死」
-
although you might agree that he might need a better publicist.
也許你可能還會認為,他可能需要一個更好的公關
-
(Laughter)
(笑聲)
-
ELA: You can get at some pretty abstract concepts with this sort of thing.
ELA:從這當中,各位也能獲得一些相當抽象的概念
-
For instance, let me tell you the history
例如,讓我跟各位說說
-
of the year 1950.
有關「1950年」的歷史
-
Pretty much for the vast majority of history,
幾乎在絕大多數的歷史裡
-
no one gave a damn about 1950.
沒有特別談論1950這一年
-
In 1700, in 1800, in 1900,
在1700年,在1800年,1900年
-
no one cared.
沒有人在乎
-
Through the 30s and 40s,
甚至到30年代和40年代
-
no one cared.
也沒有人在談論
-
Suddenly, in the mid-40s,
突然到了40年代中期
-
there started to be a buzz.
開始出現了風潮
-
People realized that 1950 was going to happen,
人們意識到1950年就要來臨
-
and it could be big.
這是件大事
-
(Laughter)
(笑聲)
-
But nothing got people interested in 1950
但也沒有因此讓大眾對該年份產生興趣
-
like the year 1950.
像是「那1950年」
-
(Laughter)
(笑聲)
-
People were walking around obsessed.
人們開始對這一年著迷
-
They couldn't stop talking
大家無法停止談論
-
about all the things they did in 1950,
有關他們在1950年所做的一切
-
all the things they were planning to do in 1950,
所有他們計畫要在1950年所做的事
-
all the dreams of what they wanted to accomplish in 1950.
所有他們要在1950年完成的夢想
-
In fact, 1950 was so fascinating
事實上,1950年跟往後幾年相較
-
that for years thereafter,
是相當迷人的一年
-
people just kept talking about all the amazing things that happened,
人們不停談論所有發生在
-
in '51, '52, '53.
'51,'52,'53年的驚奇事件
-
Finally in 1954,
直到1954年
-
someone woke up and realized
有人驚覺而且意識到
-
that 1950 had gotten somewhat passé.
1950年已經變得過時了
-
(Laughter)
(笑聲)
-
And just like that, the bubble burst.
這一切就像泡沫破滅一樣
-
(Laughter)
(笑聲)
-
And the story of 1950
1950年的情況
-
is the story of every year that we have on record,
其實就是我們數據上每一個年份的情況一樣
-
with a little twist, because now we've got these nice charts.
稍微編排一下,我們有這些精美的圖表
-
And because we have these nice charts, we can measure things.
因為有這些不錯的圖表,我們就能計算
-
We can say, "Well how fast does the bubble burst?"
我們可以了解「風潮消逝的速度是多快?」
-
And it turns out that we can measure that very precisely.
結果就是我們能很精確測量出一份數據
-
Equations were derived, graphs were produced,
有了方程式,也有圖表
-
and the net result
最終的結果就是
-
is that we find that the bubble bursts faster and faster
談論年份的風潮一年比一年
-
with each passing year.
消退的更快
-
We are losing interest in the past more rapidly.
我們對於過去的興趣日漸消逝
-
JM: Now a little piece of career advice.
JM:這張圖是有關職業建議
-
So for those of you who seek to be famous,
對於那些想成名的人
-
we can learn from the 25 most famous political figures,
我們可以知道二十五位最有名的政治人物
-
authors, actors and so on.
作家、演員等等
-
So if you want to become famous early on, you should be an actor,
如果各位想在年輕時就成名,那麼各位應該要當演員
-
because then fame starts rising by the end of your 20s --
因為你的名氣會從二十歲後開始累積
-
you're still young, it's really great.
那時正值青春年華,會相當不錯
-
Now if you can wait a little bit, you should be an author,
如果各位有耐心一點,那麼就應該當個作家
-
because then you rise to very great heights,
因為各位就能攀上高峰
-
like Mark Twain, for instance: extremely famous.
成為像是馬克吐溫這樣有名望的作家
-
But if you want to reach the very top,
但如果各位想攀上最頂尖的位置
-
you should delay gratification
就得延後滿足自己的慾望
-
and, of course, become a politician.
然後當一位政治家
-
So here you will become famous by the end of your 50s,
那麼各位會在五十歲過後開始成名
-
and become very, very famous afterward.
然後你的名氣會在未來持續延續
-
So scientists also tend to get famous when they're much older.
科學家也往往是在老年時才成名
-
Like for instance, biologists and physics
而生物學家和物理學家一樣
-
tend to be almost as famous as actors.
往往也是和演員一樣著名
-
One mistake you should not do is become a mathematician.
唯一不要做的職業就是變成數學家
-
(Laughter)
(笑聲)
-
If you do that,
如果各位真要做這行
-
you might think, "Oh great. I'm going to do my best work when I'm in my 20s."
各位可能會想「太好了,當我在二十多歲時,我會盡一切努力。」
-
But guess what, nobody will really care.
但事實上,沒人會真正去在乎你所做的事
-
(Laughter)
(笑聲)
-
ELA: There are more sobering notes
ELA:在我們的資料裡
-
among the n-grams.
還有其他更發人省思的紀錄
-
For instance, here's the trajectory of Marc Chagall,
例如馬克‧夏卡爾的名字出現的頻率軌跡
-
an artist born in 1887.
夏卡爾是位1887年出生的藝術家
-
And this looks like the normal trajectory of a famous person.
這看起來是一位名人名字正常出現在書中的軌跡
-
He gets more and more and more famous,
他的名氣日益響亮
-
except if you look in German.
但如果看德國的數據就不是如此
-
If you look in German, you see something completely bizarre,
如果看德國的數據,會看到某部份是非常奇怪的
-
something you pretty much never see,
這是幾乎不太可能看到的
-
which is he becomes extremely famous
就是他變得非常有名
-
and then all of a sudden plummets,
卻突然在1933年至1945年間
-
going through a nadir between 1933 and 1945,
聲勢跌落谷底
-
before rebounding afterward.
又反彈回升
-
And of course, what we're seeing
當然我們看的出來
-
is the fact Marc Chagall was a Jewish artist
這是因為馬克‧夏卡爾是一位猶太裔藝術家
-
in Nazi Germany.
當時德國是納粹統治
-
Now these signals
這些指標
-
are actually so strong
事實上相當明確
-
that we don't need to know that someone was censored.
我們不需要知道有人在審查書籍
-
We can actually figure it out
我們能運用基本的信號運算方式
-
using really basic signal processing.
實際了解當時狀況
-
Here's a simple way to do it