拥有一个代理池会很大程度上的帮助我们进行工作,经过一番研究,一个小的代理池就出现了,删去了很多的功能,留下了最主要得。因为储存和获取模块相对简单,所以合成为一个模块。
无私分享全套Python爬虫干货,如果你也想学习Python,@ 私信小编获取粘贴出整个模块代码。获取的代理的网站有很多,这里只写了一个。
<code>import requests from lxml import etree import time import pymongoclass
CAT_IP
():def
__init__
(
self
):self
.client = pymongo.MongoClient(host='localhost'
,port=27017
)self
.db =self
.client['proxy'
]self
.session = requests.Session()self
.headers={'Cookie'
:
'_free_proxy_session=BAh7B0kiD3Nlc3Npb25faWQGOgZFVEkiJWYwNzA1YmIzM2QzNTU0NGNjNmMyNWI3NDk1M2FlNmE5BjsAVEkiEF9jc3JmX3Rva2VuBjsARkkiMTQ5K3ZlRkx2dGs3ZmZMZTBjd1VLRTRHaUFCVDdKQTkxOTFIU3BYekYrdmc9BjsARg%3D%3D--8a2932ebb9c868977ffbc071eab471ef4144a1c6; Hm_lvt_0cf76c77469e965d2957f0553e6ecf59=1545528007,1545529206,1545554081; Hm_lpvt_0cf76c77469e965d2957f0553e6ecf59=1545554192'
,'User-Agent'
:
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'
,'Host'
:
'www.xicidaili.com'
}self
.session.get(url=self
.url, headers=self
.headers)def
the_xici
(
self
):for
iin
range(3
): time.sleep(1
) the_url = baseurl.format(i+1
) re = requests.get(url=the_url,headers=self
.headers) re = re.content.decode('utf-8'
) html = etree.HTML(text=re) targets = html.xpath('//table[@id="ip_list"]//tr'
) del targets[0
]for
targetin
targets:
target_ip =''
.join(target.xpath('./td[2]/text()'
)) target_port =''
.join(target.xpath('./td[3]/text()'
)) result ='{}:{}'
.format(target_ip,target_port) print('获取代理{}'
.format(result))yield
{'dl'
:result
}def
save_all_to_waitingArea
(
self
,lists): collection =self
.db['wait_area'
] collection.remove({}) collection.insert_many(lists) print('储存所有代理成功'
)if
__name__
=='__main__'
: lists = CAT_IP().the_xici() CAT_IP().save_all_to_waitingArea(lists) /<code>
一共有两个方法,一个是爬取免费代理的方法,另一个是将代理全部存入数据库的[‘wait_area’]表单。
在__init__方法里进行了一些参数的初始化
第二个模块是检测模块
代码如下
<code>import
pymongoimport
requestsimport
threadingclass
CHECK_PROXY
()
:def
__init__
(self)
: self.client = pymongo.MongoClient(host='localhost'
,port=27017
) self.db = self.client['proxy'
] self.session = requests.Session() self.target_url ='https://mp.csdn.net/mdeditor#'
def
save_one_to_useArea
(self,proxy)
: conllection = self.db['use_area'
] is_live = conllection.find_one({'dl'
:proxy['dl'
]})if
is_live ==None
: conllection.insert(proxy)else
: print('已经存在{}'
.format(proxy))def
get_one_proxy
(self)
: conllection = self.db['wait_area'
] proxy = conllection.find_one() conllection.remove(proxy)return
proxydef
test_IP
(self,IP)
: proxies = {"http"
:"http://{}"
.format(IP),"https"
:"http://{}"
.format(IP) }try
:with
self.session.get(url=self.target_url,proxies=proxies)as
response:if
response.status_code ==200
: print('代理{}测试成功'
.format(IP))return
True
except
: print('代理{}测试失败'
.format(IP))return
False
def
check_count
(self)
: conllection = self.db['wait_area'
]return
conllection.count()def
check_proxy
(self)
: proxy = CHECK_PROXY().get_One_proxy() response = CHECK_PROXY().test_IP(proxy['dl'
])if
response ==True
: CHECK_PROXY().save_one_to_useArea(proxy=proxy)if
CHECK_PROXY().check_count()>0
: CHECK_PROXY().check_proxy()else
:if
CHECK_PROXY().check_count()>0
: CHECK_PROXY().check_proxy()if
__name__ =="__main__"
:for
tin
range(7
): thread = threading.Thread(target=CHECK_PROXY().check_proxy,args=()) thread.start() /<code>
同样,在__init__区域进行了一些参数的初始化,save_one_to_useArea(self,proxy):为将一个代理放入数据库中的[‘use_area’]表单,用于测试成功的代理
为了帮助大家更轻松的学好Python,我给大家分享一套Python学习资料,希望对正在学习的你有所帮助!
获取方式:关注并私信小编 “ 学习 ”,即可免费获取!