Python爬蟲必備:從0到1構建自己的免費爬蟲代理IP池

Python爬蟲必備:從0到1構建自己的免費爬蟲代理IP池

為什麼要使用代理IP

在爬蟲的過程中,很多網站會採取反爬蟲技術,其中最經常使用的就是限制一個IP的訪問次數。當你本地的IP地址被該網站封禁後,可能就需要換一個代理來爬蟲。其中有很多網站提供免費的代理IP(如www.xicidaili.com),我們需要做的就是從代理網站抓取代理IP,測試代理IP的有效性後將合適的代理IP加入數據庫表中作為我們爬蟲的代理IP池

開發思路

1、通過本地IP抓取第一批啟動代理IP

我們從代理IP網站抓取代理IP的過程本身就是爬蟲,如果短時間內請求次數過多會被網站禁止訪問,因此我們需要利用本地IP去抓取第一批代理IP,然後使用代理IP去抓取新的代理IP。

2、對第一批啟動的代理IP驗證有效性後存入數據庫

我們在數據庫IP.db下建了兩個表:proxy_ip_table(存儲所有抓取的IP,用於查看抓取IP功能是否正常)和validation_ip_table(存儲所有通過驗證的IP,用於查看IP有效性)

第一步中獲取的代理IP經檢驗後存入validation_ip_table,檢驗的實現如下:

def ip_validation(self, ip):
#判斷是否高匿:非高匿的ip仍會出賣你的真實ip
anonymity_flag = False
if "高匿" in str(ip):
anonymity_flag = True
IP = str(ip[0]) + ":" + str(ip[1]);IP
url = "http://httpbin.org/get" ##測試代理IP功能的網站
proxies = { "https" : "https://" + IP} #為什麼要用https而不用http我也不清楚
headers = FakeHeaders().random_headers_for_validation()
#判斷是否可用
validation_flag = True
response = None
try:
response = requests.get(url = url, headers = headers, proxies = proxies, timeout = 5)
except:
validation_flag = False
if response is None :
validation_flag = False

if anonymity_flag and validation_flag:
return True
else:
return False

3、構建待訪問的網址列表並循環抓取,每次抓取的ip_list經驗證後存入數據庫表

我們構建了待訪問的網址列表(暫定100個容易跑完):

self.URLs = [ "https://www.xicidaili.com/nn/%d" % (index + 1) for index in range(100)] 

包含的模塊

1、RandomHeaders.py

構造隨機請求頭,用於模擬不同的網絡瀏覽器,調用方式:

from RandomHeaders import FakeHeaders
#返回請求xici代理網站的請求頭
xici_headers = FakeHeaders().random_headers_for_xici

2、DatabaseTable.py

提供數據庫的創建表和增刪查功能,調用方式:

from DatabaseTable import IPPool
tablename = "proxy_ip_table"
#tablename也可以是validation_ip_table
IPPool(tablename).create() #創建表
IPPool(tablename).select(random_flag = False)
# random_flag = True時返回一條隨機記錄,否則返回全部記錄
IPPool(table_name).delete(delete_all = True) #刪除全部記錄

3、GetProxyIP.py

核心代碼,有幾個函數可以實現不同的功能:

  • 從0開始完成建表、抓取IP和存入數據庫的功能
from GetProxyIP import Carwl
Crawl().original_run()
  • 當代理IP個數不夠的時候,根據url_list列表進行抓取,將合適的IP存入列表
from GetProxyIP import Carwl
#其他提供代理IP的網站
url_kuaidaili = ["https://www.kuaidaili.com/free/inha/%d" % (index + 1) for index in range(10,20)]
Crawl().get_more_run(url_list)
Python爬蟲必備:從0到1構建自己的免費爬蟲代理IP池

  • 當IP池太久沒用時,需要對IP有效性進行驗證,不符合要求的IP需要刪除
from GetProxyIP import Carwl
Crawl().proxy_ip_validation()
Python爬蟲必備:從0到1構建自己的免費爬蟲代理IP池

部分代碼

完整代碼請查看我的github主頁:https://github.com/TOMO-CAT/ProxyIPPool

1、RandomHeaders.py

提供隨機請求頭,模仿瀏覽器訪問以應付反爬

# -*- coding: utf-8 -*-
"""
Created on Tue Jan 29 10:36:28 2019
@author: YANG
功能:生成隨機請求頭,模擬不同的瀏覽器訪問
"""
import random
from fake_useragent import UserAgent
class FakeHeaders(object):
"""
生成隨機請求頭
"""
def __init__(self):
self.__UA = [
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
"Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Opera/9.27 (Windows NT 5.2; U; zh-cn)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
"Opera/9.80 (Windows NT 5.1; U; zh-cn) Presto/2.9.168 Version/11.50",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.9.2.1000 Chrome/39.0.2146.0Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36",
"Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.3 Safari/537.36",
"Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.154 Safari/537.36 LBBROWSER",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586",

"Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36 OPR/37.0.2178.32",
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36 Core/1.47.277.400 QQBrowser/9.4.7658.400",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 UBrowser/5.6.12150.8 Safari/537.36",
"Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3",
"Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
"Mozilla/4.0 (compatible; MSIE 12.0",
"Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1",
"Mozilla/5.0 (Windows NT 5.2) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.122 Safari/534.30",
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; Touch; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; Tablet PC 2.0)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0)",
"Mozilla/5.0 (Windows NT 5.1; rv:44.0) Gecko/20100101 Firefox/44.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 TheWorld 7",
"Mozilla/5.0 (Windows NT 6.1; rv,2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE2.X MetaSr 1.0",
]
#UserAgent用戶代理,主要提供瀏覽器類型及版本、操作系統及版本和瀏覽器內核等信息
def random_headers_for_xici(self):
headers = {
"User-Agent": UserAgent().random, ##隨機選擇UA
"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding":"gzip, deflate, br",
"Cache-Control":"max-age=0",
"Connection":"keep-alive",
"Host":"www.xicidaili.com",
"Upgrade-Insecure-Requests":"1"
}
return headers
def random_headers_for_validation(self):
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",

"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Connection": "close",
"Host": "httpbin.org",
"Upgrade-Insecure-Requests": "1",
"User-Agent": UserAgent().random}
return headers
if __name__ == "__main__":
print("隨機抽取20條headers:")
for i in range(20):
print(FakeHeaders().random_headers_for_xici())

2、DatabaseTable.py

提供數據庫功能,這裡提供了能存儲IP的數據庫IP.db。

import sqlite3 ##可以在 Python 程序中使用 SQLite 數據庫
import time
class IPPool(object):
##存儲ip的數據庫,包括兩張表ip_table和all_ip_table
##insert和建表語句綁定在一起
def __init__(self,table_name):
self.__table_name = table_name
self.__database_name = "IP.db" ##IPPool對應的數據庫為IP.db
##初始化類,傳入參數table_name


def create(self):
conn = sqlite3.connect(self.__database_name, isolation_level = None)
conn.execute(
"create table if not exists %s(IP CHAR(20) UNIQUE, PORT INTEGER, ADDRESS CHAR(50), TYPE CHAR(50), PROTOCOL CHAR(50))"
% self.__table_name)
print("IP.db數據庫下%s表建表成功" % self.__table_name)
##建表語句

def insert(self, ip):
conn = sqlite3.connect(self.__database_name, isolation_level = None)
#isolation_level是事務隔離級別,默認是需要自己commit才能修改數據庫,置為None則自動每次修改都提交

for one in ip:
conn.execute(
"insert or ignore into %s(IP, PORT, ADDRESS, TYPE, PROTOCOL) values (?,?,?,?,?)"
% (self.__table_name),
(one[0], one[1], one[2], one[3], one[4]))
conn.commit() #提交insert 但是已經設置isolaion_level為None,所以應該不需要
conn.close()
def select(self,random_flag = False):
conn = sqlite3.connect(self.__database_name,isolation_level = None)
##連接數據庫
cur=conn.cursor()
#cursor用於接受返回的結果

if random_flag:
cur.execute(
"select * from %s order by random() limit 1"
% self.__table_name)
result = cur.fetchone()
#如果是random_flag為T則隨機抽取一條記錄並返回
else:
cur.execute("select * from %s" % self.__table_name)
result = cur.fetchall()
cur.close()
conn.close()
return result

def delete(self, IP = ('1',1,'1','1','1'), delete_all=False):
conn = sqlite3.connect(self.__database_name,isolation_level = None)
if not delete_all:
n = conn.execute("delete from %s where IP=?" % self.__table_name,
(IP[0],))
#逗號不能省,元組元素只有一個的時候一定要加
print("刪除了",n.rowcount,"行記錄")
else:
n = conn.execute("delete from %s" % self.__table_name)
print("刪除了全部記錄,共",n.rowcount,"行")
conn.close()

問題&改進

  • 從代理IP網站抓取代理IP本質上也是爬蟲,代理IP網站也設置了反爬機制,大概在xici獲取4000個代理IP左右就會被封IP,從而需要使用代理來獲取代理IP。
  • 和網上的經驗不同,xici網前100頁的代理IP可用率還是比較高的,基本上有九成。但是可能會有"檢驗有效的代理IP不一定能用"和"第一次檢驗失效的代理IP可能以後能用"的問題,這也是我將代理IP和有效的代理IP分別存儲在兩張表的原因。
  • 使用代理IP時,構建成http和https可能會導致截然不同的結果,同樣,將目標url寫成http或https可能會導致程序正常運行和報錯兩種結果。暫時我還不清楚原因。
  • 由於程序使用了網絡,並且循環次數較多。因此我多次使用了continue跳出循環,但是獲取有效的代理IP還是成功率比較高的。
  • 獲取10000個代理IP的速度基本上需要五個小時,實在太慢了,後續如果改進程序的話可能會嘗試多線程。

寫在最後

鑑於是本人寫的第一個python程序,很多地方都不是很明白,代碼的註釋文檔也比較繁瑣。如果是有爬蟲需要的話,構建自己的代理IP池還是很必要的,有任何改進的建議請直接與我聯繫,謝謝~


分享到:


相關文章: