Placeholder Image

字幕列表 影片播放

  • There are more pages on the web than people on Earth.

  • And while I haven’t checked, I am sure each one is full of original, high quality content

  • that would make our ancestors proud.

  • Most people access web pages through a browser, but as programmers we have other methods...

  • Today, we will learn how to use Python to send GET requests to web servers, and then

  • parse the response.

  • This way you can write software to read websites for you,

  • giving you more time to browse the internet.

  • In a browser, you access a web page by typing the URL in the address bar.

  • URL stands forUniform Resource Locatorand this string can hold a LOT of information.

  • At the beginning is the protocol, which is sometimes called the scheme.

  • Next is the host name.

  • Sometimes you will see a colon followed by a number.

  • That number is the port.

  • If the port is not explicitly specified, you can determine it from the protocol.

  • HTTP uses port 80, while HTTPS uses port 443.

  • After the host name comes the path.

  • The text after the question mark is called thequery string”.

  • It holds a collection of key-value pairs separated by ampersands.

  • And lastly, you may see a hashtag at the end followed by a string.

  • This value is called a fragment and is used to jump to a section within the webpage.

  • Python 3 comes equipped with a package that simplifies the task of building, loading and

  • parsing URLs TheURL LIBpackage...

  • This package contains five modules: request, response, error, parse and robotparser

  • The request module is used to open URLs The response module is used internally by

  • the request module - you will not work with this directly

  • The error module contains several error classes for use by the request module

  • The parse module has a variety of functions for breaking up a URL into meaningful pieces,

  • like the scheme, host, port, and query string data.

  • And finally there is robotparser.

  • An exciting name, for a less than exciting module...

  • It is used to inspect robots.txt (“robots-dot-t-x-t”) files for what permissions are granted to

  • bots and crawlers.

  • Today we will focus on the request module, since this is where the action lies.

  • To begin, import url-lib.

  • Now use thedirectoryfunction to see what is available.

  • Not much

  • This is because urllib is a package holding the modules that do the actual work.

  • So instead, import the module inside urllib that you want to use.

  • We want to use therequestmodule.

  • If you call the directory function on the request module, you will see a lot of classes

  • and functions.

  • The function which enables you to easily open a specific URL is theurlopenfunction.

  • Just as theopenfunction is used to open files,

  • urlopenis used to open URLs.

  • As an example, let us open the home page for Wikipedia.

  • The function returns a “responseobject.

  • If you look at the type, you will see it is NOT the response in the urllib package, but

  • a different type of response from a different package.

  • To see what you can do with the response, use the directory function.

  • First, let us check if the request was successful by looking at the response code.

  • 200…

  • This is actually good news.

  • A 200 response code means everything went OK.

  • You may ask why the number 200 was chosen.

  • I may ask the same thing...

  • Next, let us see how large the response is.

  • This is the size of the response in BYTES.

  • We can use thepeekfunction to look at small part of the response, rather than

  • the full value.

  • This most definitely looks like HTML, but notice that this is not a string.

  • The “b” at the beginning tells us this is a “bytes object

  • The reason for this is that web servers can host binary data

  • in addition to plain HTML files.

  • Let us now read the entire response.

  • If you look at the type, it is indeed a bytes object.

  • And it is the correct size...

  • We can convert this to text by decoding it.

  • If you look at the peek value, the character set in the response isUTF-8”

  • So to decode this bytes object, call thedecodemethod and specify the encoding that was used.

  • We now have a string

  • And if you display the value, you can see all the HTML for the web page.

  • By the way, look what happens if you try to read the response a second time.

  • Nothing

  • This is because once you read the response, Python closes the connection.

  • As a second example, let us send a search request to Google.

  • How rude!

  • Earlier we said that a 200 response code meant everything was OK.

  • So things are definitely not OK.

  • A 403 response code means that while our request was valid, the server is refusing to respond.

  • I can understand their reaction.

  • If they let anyone scrape their search results without restriction, then competitors would

  • use this information to their advantage.

  • Let us try a different example...

  • We will now load the YouTube page for this incredible video on Black Holes.

  • Here is the URL.

  • Notice that this URL contains two parameters in the query string: V and T.

  • V is the video ID, and T is the time in the video to begin playback.

  • One way to construct the querystring is to append a lot of strings together.

  • But there is an easier way.

  • To see this, import theparsemodule.

  • Looking at the directory, you can see a large collection of functions for working with URLs.

  • Here, we will use theurlencodefunction.

  • First, we create a dictionary containing the querystring parameters.

  • Next, call theurlencodefunction.

  • The result is a string that is suitable for use as the querystring.

  • Notice, however, that the question mark is NOT included.

  • We can now build the URL.

  • Next, open the url using theurlopenmethod.

  • If you call theisclosedmethod,

  • you can see we still have a connection with the server.

  • The response code is 200, so our request was fulfilled.

  • How generous

  • We can then read and decode the server response in a single line.

  • Looking at the first 500 characters of the html, we see everything looks to be in order.

  • You have now taken your first step towards bypassing the browser, and interacting with

  • web servers programmatically.

  • But there is much more to learn.

  • What if you want to send a POST or PUT request?

  • How do you include cookies in your request?

  • What if authentication is required?

  • And what if you aren’t subscribed to Socratica?

  • Why don’t we make videos more quickly?

  • Be patient.

  • You will soon learn how to solve all of these problems

There are more pages on the web than people on Earth.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

Urllib - GET請求 ||Python教程 ||學習Python編程 (Urllib - GET Requests || Python Tutorial || Learn Python Programming)

  • 12 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字