網絡爬蟲 - CS101 - Udacity. (Web Crawler - CS101 - Udacity)

字幕列表影片播放

[Sebastian Thrun] So what's your take on how to build a search engine,

建立搜尋引擎 (search engine) ，你有什麼收穫呢?
you've build one before, right?

你曾建立一個，對嗎？
[Sergey Brin - Co-Founder, Google] Yes. I think the most important thing

是的，如果你想要建立一個搜尋引擎
if you're going to build a search engine

我認為最重要的事
is to have a really good corpus to start out with.

是從一個非常好的語料庫 (corpus) 開始
In our case we used the world wide web, which at time was certainly smaller than it is today.

我們以前使用 WWW，它比今天的 WWW 小多了
But it was also very new and exciting.

但是它仍很新奇、令人興奮的
There were all sorts of unexpected things there.

有各種出乎意料的事情
[David Evans] So the goal for the first three units for the course is to build that corpus.

因此課程前三單元的目標，是建立語料庫
And we want to build the corpus for our search engine

藉由爬行網頁來為我們的搜尋引擎建立語料庫
by crawling the web and that's what a web crawler does.

爬行網頁是網頁蜘蛛 (web crawler) 的工作
What a web crawler is, it's a program that collects content from the web.

網頁蜘蛛是一個從網路收集內容的程式
If you think of a web page that you see in your browser, you have a page like this.

想像一個你在瀏覽器看到的網頁，一個這樣的網頁
And we'll use the udacity site as an example web page.

我們將使用 udacity 的網站做為網頁的例子
It has lot's of content, it has some images, it has some text.

它有很多內容，有一些圖像，有一些文字
All of this comes into your browser when you request the page.

當您請求這個網頁時，所有的內容都來到你的瀏覽器 (browser)
The important thing that it has is links.

重要的是，網頁含有連結 (link)
And what a link is, is something that goes to another page.

連結 (link) 是什麼? link 通往另一個網頁
So we have a link to the frequently asked questions,

有一個通往「常見問題」的 link
we have a link to CS 101 page.

有一個通往 CS101 網頁的 link
There's some other links on the page.

還有其他一些 link
And that link may show in you browser with an underscore,

link 在瀏覽器中顯示的時候，可能帶有底線
it may not, depending on how your browser is set.

也可能沒有，取決於瀏覽器的設定
But the important thing that it does,

重要的是
is it's a pointer to some other web page.

link 是通往其他網頁的指引
And those other web pages may also have links

而其他網頁也可能含有 link
so we have another link on this page.

這個網頁上有另一個 link
Maybe it's to my name, you can follow to my home page.

也許它通往我的名字，你可以跟隨它通往我的首頁 (homepage)
And all the pages that we can find with our web crawler

網頁蜘蛛能找到的所有網頁
are found by following the links.

都是跟隨 link 而找到的
So it won't necessarily find every page on the web

沒有必要找出網路中的每個網頁
If we start with a good seed page

如果我們從一個好的種子網頁 (seed page) 開始
we'll find lot's of pages, though.

就可以找到很多網頁
And what the crawler's gonna do is start with one page,

網頁蜘蛛要做的，就是從一個網頁開始
find all the links on that page, follow them to find other pages

找出網頁中所有的 link，跟隨它們，找到其他的網頁
and then on those other pages it will follow the links on those pages

然後在這些網頁裡，繼續跟隨網頁中的 link
to find other pages and there will be lot's more links on those pages.

以找到其他網頁，那些網頁中有更多的 link
And eventually we'll have a collection of lot's of pages on the web.

最後，我們收集到網路中很多的網頁
So that's what we want to do to build a web crawler.

這就是我們要網頁蜘蛛做的事
We want to find some way to start from one seed page,

我們希望找到方法，從一個 seed page 開始
extract the links on that page,

擷取網頁上的 link
follow those links to other pages,

跟隨這些 link 找到其他的網頁
then collect the links on those other pages,

然後收集那些網頁中的 link
follow them, collect all that.

跟隨它們，收集所有的 link
So that sounds like a lot to do.

好像有很多事要做
We're not going to all that this first class.

第一節課不會完成所有的事情
What we're going to do this first unit, is just extract a link.

第一單元要做的，只是擷取一個 link
So we're going to start with a bunch of text.

我們將從一堆文字開始
It's going to have a link in it with a URL.

其中帶有 URL 的 link
What we want to find is that URL,

我們要找出那個 URL
so we can request the next page.

這樣才能請求下一個網頁
The goal for the second unit

第二單元的目標是
is be able to keep going.

能夠持續地做下去
if there's many links on one page, you will want to be able to find them all.

如果網頁中有很多 link，你要把它們全找出來
So that's what we'll do in unit 2,

第二單元要做的
is to figure out how to keep going to extract all those links.

是要弄清楚如何持續的擷取所有的 link
In unit three, well, we want to go beyond just one page.

第三單元，嗯，我們將超越一個網頁
So by the end of unit two we can print out all the links on one page.

第二單元結束時，我們能夠印出一個網頁中的所有 link
For unit 3 we want to collect all those links, so we can keep going,

第三單元我們要收集所有的 link，才可以持續下去
end up following our crawler to collect many, many pages.

直到網頁蜘蛛收集了很多、很多的網頁
So by the end of unit three we'll have built a web crawler.

第三單元結束時，我們將建立一個網頁蜘蛛
We'll have a way of building our corpus.

我們有一個建立語料庫的方法
Then the remaining three units will look at how to actually respond to queries.

剩下三個單元重點在於，如何回應查詢 (queries)
So in unit four we'll figure out how to give a good response.

第四單元我們將探討，如何給出好的回應
So if you search for a keyword, you want to get a response that's a list of the pages

當搜索一個關鍵字 (keyword) 時，我們要給出一個網頁列表 (list) 當作回應
where that keyword appears.

列表中的網頁都出現了關鍵字
And we'll figure out in unit five a way to do that, that scales, if we have a large corpus.

第五單元我們將思考，如果有一個很大的語料庫，如何擴展規模
And then in unit six what we want to do is, well, we don't just want to find a list,

第六單元要做的，不只是找出一個網頁列表
we want to find the best one.

而是要找出最佳的網頁
So we'll figure out how to rank all the pages where that keyword appears.

我們將探討，如何為含有關鍵字的網頁來評分
So we're getting a little ahead of ourselves now,

我們已經講得有點遠了
because all we're going to do for unit one,

因為第一單元要做的只是
is to figure out how to extract a link from the page.

思考如何從網頁中擷取一個 link
And the search engine that we'll build at the end of this

課程結束時，我們所建立的搜尋引擎
will be a functional search engine.

將是一個功能完整的搜尋引擎
It will have the main components that a search engine like Google has.

它將擁有像 Google 這種搜尋引擎所具備的主要元件
It certainly won't be as powerful as Google will be,

它當然不會像 Google 那麼強大
we want to keep things simple.

我們想要簡單一點
We want to have a small amount of code to write.

我們只要寫少量的程式
And we should remember that our real goal

要記住我們的目標
is not as much to build a search engine,

重點不是建立一個搜尋引擎
but to use the goal of building a search engine as a vehicle

而是將建立搜尋引擎當作一個手段
for learning about computer science

來學習電腦科學
and learning about programming

以及學習程式設計
so the things we learn by doing this

透過這樣的學習
will allow us to solve lot's and lot's of other problems.

將讓我們有能力解決很多、很多其他的問題