Placeholder Image

字幕列表 影片播放

  • [Sebastian Thrun] So what's your take on how to build a search engine,

    建立搜尋引擎 (search engine) ,你有什麼收穫呢?

  • you've build one before, right?

    你曾建立一個,對嗎?

  • [Sergey Brin - Co-Founder, Google] Yes. I think the most important thing

    是的,如果你想要建立一個搜尋引擎

  • if you're going to build a search engine

    我認為最重要的事

  • is to have a really good corpus to start out with.

    是從一個非常好的語料庫 (corpus) 開始

  • In our case we used the world wide web, which at time was certainly smaller than it is today.

    我們以前使用 WWW,它比今天的 WWW 小多了

  • But it was also very new and exciting.

    但是它仍很新奇、令人興奮的

  • There were all sorts of unexpected things there.

    有各種出乎意料的事情

  • [David Evans] So the goal for the first three units for the course is to build that corpus.

    因此課程前三單元的目標,是建立語料庫

  • And we want to build the corpus for our search engine

    藉由爬行網頁來為我們的搜尋引擎建立語料庫

  • by crawling the web and that's what a web crawler does.

    爬行網頁是網頁蜘蛛 (web crawler) 的工作

  • What a web crawler is, it's a program that collects content from the web.

    網頁蜘蛛是一個從網路收集內容的程式

  • If you think of a web page that you see in your browser, you have a page like this.

    想像一個你在瀏覽器看到的網頁,一個這樣的網頁

  • And we'll use the udacity site as an example web page.

    我們將使用 udacity 的網站做為網頁的例子

  • It has lot's of content, it has some images, it has some text.

    它有很多內容,有一些圖像,有一些文字

  • All of this comes into your browser when you request the page.

    當您請求這個網頁時, 所有的內容都來到你的瀏覽器 (browser)

  • The important thing that it has is links.

    重要的是,網頁含有連結 (link)

  • And what a link is, is something that goes to another page.

    連結 (link) 是什麼? link 通往另一個網頁

  • So we have a link to the frequently asked questions,

    有一個通往「常見問題」的 link

  • we have a link to CS 101 page.

    有一個通往 CS101 網頁的 link

  • There's some other links on the page.

    還有其他一些 link

  • And that link may show in you browser with an underscore,

    link 在瀏覽器中顯示的時候,可能帶有底線

  • it may not, depending on how your browser is set.

    也可能沒有,取決於瀏覽器的設定

  • But the important thing that it does,

    重要的是

  • is it's a pointer to some other web page.

    link 是通往其他網頁的指引

  • And those other web pages may also have links

    而其他網頁也可能含有 link

  • so we have another link on this page.

    這個網頁上有另一個 link

  • Maybe it's to my name, you can follow to my home page.

    也許它通往我的名字,你可以跟隨它通往我的首頁 (homepage)

  • And all the pages that we can find with our web crawler

    網頁蜘蛛能找到的所有網頁

  • are found by following the links.

    都是跟隨 link 而找到的

  • So it won't necessarily find every page on the web

    沒有必要找出網路中的每個網頁

  • If we start with a good seed page

    如果我們從一個好的種子網頁 (seed page) 開始

  • we'll find lot's of pages, though.

    就可以找到很多網頁

  • And what the crawler's gonna do is start with one page,

    網頁蜘蛛要做的,就是從一個網頁開始

  • find all the links on that page, follow them to find other pages

    找出網頁中所有的 link,跟隨它們,找到其他的網頁

  • and then on those other pages it will follow the links on those pages

    然後在這些網頁裡,繼續跟隨網頁中的 link

  • to find other pages and there will be lot's more links on those pages.

    以找到其他網頁,那些網頁中有更多的 link

  • And eventually we'll have a collection of lot's of pages on the web.

    最後,我們收集到網路中很多的網頁

  • So that's what we want to do to build a web crawler.

    這就是我們要網頁蜘蛛做的事

  • We want to find some way to start from one seed page,

    我們希望找到方法,從一個 seed page 開始

  • extract the links on that page,

    擷取網頁上的 link

  • follow those links to other pages,

    跟隨這些 link 找到其他的網頁

  • then collect the links on those other pages,

    然後收集那些網頁中的 link

  • follow them, collect all that.

    跟隨它們,收集所有的 link

  • So that sounds like a lot to do.

    好像有很多事要做

  • We're not going to all that this first class.

    第一節課不會完成所有的事情

  • What we're going to do this first unit, is just extract a link.

    第一單元要做的,只是擷取一個 link

  • So we're going to start with a bunch of text.

    我們將從一堆文字開始

  • It's going to have a link in it with a URL.

    其中帶有 URL 的 link

  • What we want to find is that URL,

    我們要找出那個 URL

  • so we can request the next page.

    這樣才能請求下一個網頁

  • The goal for the second unit

    第二單元的目標是

  • is be able to keep going.

    能夠持續地做下去

  • if there's many links on one page, you will want to be able to find them all.

    如果網頁中有很多 link,你要把它們全找出來

  • So that's what we'll do in unit 2,

    第二單元要做的

  • is to figure out how to keep going to extract all those links.

    是要弄清楚如何持續的擷取所有的 link

  • In unit three, well, we want to go beyond just one page.

    第三單元,嗯,我們將超越一個網頁

  • So by the end of unit two we can print out all the links on one page.

    第二單元結束時,我們能夠印出一個網頁中的所有 link

  • For unit 3 we want to collect all those links, so we can keep going,

    第三單元我們要收集所有的 link,才可以持續下去

  • end up following our crawler to collect many, many pages.

    直到網頁蜘蛛收集了很多、很多的網頁

  • So by the end of unit three we'll have built a web crawler.

    第三單元結束時,我們將建立一個網頁蜘蛛

  • We'll have a way of building our corpus.

    我們有一個建立語料庫的方法

  • Then the remaining three units will look at how to actually respond to queries.

    剩下三個單元重點在於,如何回應查詢 (queries)

  • So in unit four we'll figure out how to give a good response.

    第四單元我們將探討,如何給出好的回應

  • So if you search for a keyword, you want to get a response that's a list of the pages

    當搜索一個關鍵字 (keyword) 時, 我們要給出一個網頁列表 (list) 當作回應

  • where that keyword appears.

    列表中的網頁都出現了關鍵字

  • And we'll figure out in unit five a way to do that, that scales, if we have a large corpus.

    第五單元我們將思考, 如果有一個很大的語料庫,如何擴展規模

  • And then in unit six what we want to do is, well, we don't just want to find a list,

    第六單元要做的,不只是找出一個網頁列表

  • we want to find the best one.

    而是要找出最佳的網頁

  • So we'll figure out how to rank all the pages where that keyword appears.

    我們將探討,如何為含有關鍵字的網頁來評分

  • So we're getting a little ahead of ourselves now,

    我們已經講得有點遠了

  • because all we're going to do for unit one,

    因為第一單元要做的只是

  • is to figure out how to extract a link from the page.

    思考如何從網頁中擷取一個 link

  • And the search engine that we'll build at the end of this

    課程結束時,我們所建立的搜尋引擎

  • will be a functional search engine.

    將是一個功能完整的搜尋引擎

  • It will have the main components that a search engine like Google has.

    它將擁有像 Google 這種搜尋引擎所具備的主要元件

  • It certainly won't be as powerful as Google will be,

    它當然不會像 Google 那麼強大

  • we want to keep things simple.

    我們想要簡單一點

  • We want to have a small amount of code to write.

    我們只要寫少量的程式

  • And we should remember that our real goal

    要記住我們的目標

  • is not as much to build a search engine,

    重點不是建立一個搜尋引擎

  • but to use the goal of building a search engine as a vehicle

    而是將建立搜尋引擎當作一個手段

  • for learning about computer science

    來學習電腦科學

  • and learning about programming

    以及學習程式設計

  • so the things we learn by doing this

    透過這樣的學習

  • will allow us to solve lot's and lot's of other problems.

    將讓我們有能力解決很多、很多其他的問題

[Sebastian Thrun] So what's your take on how to build a search engine,

建立搜尋引擎 (search engine) ,你有什麼收穫呢?

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋