01.29 python 之基本庫的使用技术頭條網

01.29 python 之基本庫的使用

2020-01-29 22:39:51 洛鴻0920

3.1urllib 的使用

Urllib分為四個部分，request,error,parse,robotparser四個部分

Request:這是一個基礎的HTTP請求模塊，用來進行模擬發送請求，模擬在瀏覽器中輸入網址然後回車。通過request庫傳入URL以及其他額外的參數，就可以模擬實現這個過程。

Error：異常處理模塊，如果出現了請求錯誤，通過捕獲異常是的程序不會意外終止。

Parse：一個工具模塊，提供URL的處理辦法，如拆分，解析與合併等。

Robotparser：主要用來識別網站的ROBOT.TXT文件，判斷哪些文件能夠爬取，哪些不能。

（1）發送請求

urlopen()

這個函數在urllib.request模塊中，urllib.request可以用來對一個url進行請求的發起過程，同時還有授權驗證(authenticaton）,重定向(redirection）,瀏覽器的cookies以及其他內容。

import urllib.request

response=urllib.request.urlopen('https://www.python.org')

print(response.read().decode('utf-8'))

這裡就完成了對python官網的抓取，輸出了網頁的源代碼

print(type(response))

<class>

可以發現這是一個HTTPResponse的對象，主要包含read(),readinto(),getheader(),getheaders(),fillno()等方法，同時包括msg,version,reason,debuglevel,closed等屬性。

調用read（）方法就可以得到返回的網頁內容，調用status就可以得到返回結果的狀態碼。

print(response.status)

print(response.getheaders())

print(response.getheader('Server'))

200

[('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'SAMEORIGIN'), ('x-xss-protection', '1; mode=block'), ('X-Clacks-Overhead', 'GNU Terry Pratchett'), ('Via', '1.1 varnish'), ('Content-Length', '48805'), ('Accept-Ranges', 'bytes'), ('Date', 'Wed, 27 Jun 2018 06:37:23 GMT'), ('Via', '1.1 varnish'), ('Age', '2961'), ('Connection', 'close'), ('X-Served-By', 'cache-iad2146-IAD, cache-bur17522-BUR'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '3, 602'), ('X-Timer', 'S1530081443.039014,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]

nginx

200為狀態碼，第二個是響應的頭文件，最後一個通過調用getheader()並傳入的Server的參數，得到nginx,說明服務器是有nginx搭建的。

利用urlopen()方法可以進行簡單的網頁Get請求的抓取，實現給鏈接傳入參數。

格式：urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,cadefault=False,context=None)

1) data參數

data參數時可選的，當添加該參數時，並且它是以字節流編碼格式的內容，即bytes類型，需要用bytes()方法進行轉化。如果傳遞了這個參數，請求方式就變成了post，而不是get.

Post 與get的區別

詳情見：

import urllib.parse

import urllib.request

data=bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')

response=urllib.request.urlopen('https://httpbin.org/post',data=data)

print(response.read())

b'{"args":{},"data":"","files":{},"form":{"word":"hello"},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"10","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Python-urllib/3.5"},"json":null,"origin":"171.221.3.139","url":"https://httpbin.org/post"}\\n'

這裡傳入了一個word，值為hello，利用parse的urlencode()函數將參數字典轉化成字符串，再用轉字節流bytes（str,encoding=）變成字節類型。

請求的站點為，提供http的post請求測試，輸出請求的一些信息，包含我們傳遞的data參數。

2）timeout參數

timeout參數主要用來設置超時時間，單位為秒，請求超過了這個時間，沒有得到響應，就會跑出異常。

import urllib.request

response=urllib.request.urlopen('https://httpbin.org/get',timeout=0.1)

print(response.read())

URLError: <urlopen>

一般來說請求響應都不會這麼快，所以肯定會拋出異常

在一般的抓取過程中，如果一個網頁如果長時間麼有響應，就跳過它的抓取，可以利用try，except語句來實現。

import socket

import urllib.request