一、頁面獲取
首先我們需要用到以下幾個庫
<code>import
requestsimport
csvimport
randomimport
timeimport
socketimport
http.client from bs4import
BeautifulSoup/<code>
在python3中,安裝庫在windows及linux環境下均可以使用以下命令進行安裝
<code>例: pip3 install requests
/<code>
requests:用來抓取網頁的html源代碼
csv:將數據寫入到csv文件中
random:取隨機數
time:時間相關操作
socket和http.client 在這裡只用於異常處理
BeautifulSoup:用來代替正則式取源碼中相應標籤中的內容,將內容以合適的方法輸出。
以下是獲取網頁內容的代碼
<code>def get_content(url , data = None): header={'Accept'
:'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
,'Accept-Encoding'
:'gzip, deflate, sdch'
,'Accept-Language'
:'zh-CN,zh;q=0.8'
,'Connection'
:'keep-alive'
,'User-Agent'
:'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.235'
} timeout =random
.choice(range(80
,180
))while
True: try:rep
= requests.get(url,headers = header,timeout = timeout)rep
.encoding ='utf-8'
# req = urllib.request.Request(url, data, header) # response = urllib.request.urlopen(req, timeout=timeout) # html1 = response.read
().decode('UTF-8'
, errors='ignore'
) # response.close
()break
# except urllib.request.HTTPError as e: #'1:'
, e) #time
.sleep(random
.choice(range(5
,10
))) # # except urllib.request.URLError as e: #'2:'
, e) #time
.sleep(random
.choice(range(5
,10
))) except socket.timeout as e:'3:'
, e)time
.sleep(random
.choice(range(8
,15
))) except socket.error
as e:'4:'
, e)time
.sleep(random
.choice(range(20
,60
))) except http.client.BadStatusLine as e:'5:'
, e)time
.sleep(random
.choice(range(30
,80
))) except http.client.IncompleteRead as e:'6:'
, e)time
.sleep(random
.choice(range(5
,15
)))return
rep
.text #return
html_text/<code>
header是requests.get的一個參數,目的是模擬瀏覽器訪問,使用瀏覽器按下F12即可獲取header信息。
二、獲取網頁中我們需要的字段
代碼如下
<code>def
get_data
(html_text
): final = [] bs = BeautifulSoup(html_text,"html.parser"
) body = bs.body data = body.find('div'
, {'id'
:'7d'
}) ul = data.find('ul'
) li = ul.find_all('li'
)for
dayin
li: temp = [] date = day.find('h1'
).string
temp.append(date) inf = day.find_all('p'
) temp.append(inf[0
].string
,)if
inf[1
].find('span'
)is
None: temperature_highest = Noneelse
: temperature_highest = inf[1
].find('span'
).string
temperature_highest = temperature_highest.replace('℃'
,''
) temperature_lowest = inf[1
].find('i'
).string
temperature_lowest = temperature_lowest.replace('℃'
,''
) temp.append(temperature_highest) temp.append(temperature_lowest) final.append(temp)return
final/<code>
三、寫入文件csv
<code>def write_data(data, name): file_name = namewith
open
(file_name,'a'
,errors
='ignore'
,newline
=''
)as
f: f_csv = csv.writer(f) f_csv.writerows(data
)/<code>
四、主函數
<code>if
__name__ =='__main__'
: url ='http://www.weather.com.cn/weather/101190401.shtml'
html = get_content(url) result = get_data(html) write_data(result,'weather.csv'
) /<code>
這樣就可以生成一個簡易的用來獲取天氣情況的小爬蟲了。