怎樣製作一個Python爬蟲——第一個爬蟲小程序

怎樣製作一個Python爬蟲——第一個爬蟲小程序


一、頁面獲取


首先我們需要用到以下幾個庫

<code>

import

requests

import

csv

import

random

import

time

import

socket

import

http.client from bs4

import

BeautifulSoup/<code>

在python3中,安裝庫在windows及linux環境下均可以使用以下命令進行安裝

<code>

例: pip3 install requests

/<code>

requests:用來抓取網頁的html源代碼

csv:將數據寫入到csv文件中

random:取隨機數

time:時間相關操作

socket和http.client 在這裡只用於異常處理

BeautifulSoup:用來代替正則式取源碼中相應標籤中的內容,將內容以合適的方法輸出。


以下是獲取網頁內容的代碼

<code>def get_content(url , data = None):
    header={
        

'Accept'

:

'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'

,

'Accept-Encoding'

:

'gzip, deflate, sdch'

,

'Accept-Language'

:

'zh-CN,zh;q=0.8'

,

'Connection'

:

'keep-alive'

,

'User-Agent'

:

'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235'

} timeout =

random

.choice(range(

80

,

180

))

while

True: try:

rep

= requests.get(url,headers = header,timeout = timeout)

rep

.encoding =

'utf-8'

# req = urllib.request.Request(url, data, header) # response = urllib.request.urlopen(req, timeout=timeout) # html1 = response.

read

().decode(

'UTF-8'

, errors=

'ignore'

) # response.

close

()

break

# except urllib.request.HTTPError as e: #

print

(

'1:'

, e) #

time

.sleep(

random

.choice(range(

5

,

10

))) # # except urllib.request.URLError as e: #

print

(

'2:'

, e) #

time

.sleep(

random

.choice(range(

5

,

10

))) except socket.timeout as e:

print

(

'3:'

, e)

time

.sleep(

random

.choice(range(

8

,

15

))) except socket.

error

as e:

print

(

'4:'

, e)

time

.sleep(

random

.choice(range(

20

,

60

))) except http.client.BadStatusLine as e:

print

(

'5:'

, e)

time

.sleep(

random

.choice(range(

30

,

80

))) except http.client.IncompleteRead as e:

print

(

'6:'

, e)

time

.sleep(

random

.choice(range(

5

,

15

)))

return

rep

.text #

return

html_text/<code>

header是requests.get的一個參數,目的是模擬瀏覽器訪問,使用瀏覽器按下F12即可獲取header信息。

二、獲取網頁中我們需要的字段

代碼如下

<code> 

def

get_data

(

html_text

): final

= [] bs = BeautifulSoup(html_text,

"html.parser"

) body = bs.body data = body.find(

'div'

, {

'id'

:

'7d'

}) ul = data.find(

'ul'

) li = ul.find_all(

'li'

)

for

day

in

li: temp = [] date = day.find(

'h1'

).

string

temp.append(date) inf = day.find_all(

'p'

) temp.append(inf[

0

].

string

,)

if

inf[

1

].find(

'span'

)

is

None: temperature_highest = None

else

: temperature_highest = inf[

1

].find(

'span'

).

string

temperature_highest = temperature_highest.replace(

'℃'

,

''

) temperature_lowest = inf[

1

].find(

'i'

).

string

temperature_lowest = temperature_lowest.replace(

'℃'

,

''

) temp.append(temperature_highest) temp.append(temperature_lowest) final.append(temp)

return

final/<code>

三、寫入文件csv

<code>def write_data(data, name):
    file_name = name
    

with

open

(file_name,

'a'

,

errors

=

'ignore'

,

newline

=

''

)

as

f: f_csv = csv.writer(f) f_csv.writerows(

data

)/<code>

四、主函數

<code>

if

__name__ ==

'__main__'

: url =

'http://www.weather.com.cn/weather/101190401.shtml'

html = get_content(url) result = get_data(html) write_data(result,

'weather.csv'

) /<code>


怎樣製作一個Python爬蟲——第一個爬蟲小程序

這樣就可以生成一個簡易的用來獲取天氣情況的小爬蟲了。


分享到:


相關文章: