Python實戰！多線程批量解鎖1080P高清美圖頭條網

2021-04-09 07:14:45 佚名

Python是一種面向對象的解釋型編程語言，源代碼與解釋器CPython遵守GPL協議，Python語法簡潔清晰。學好爬蟲技能，可為後續的大數據分析、挖掘、機器學習等提供重要的數據源。

無私分享全套Python爬蟲資料，私信“學習”免費領取哦~~

環境準備

requests：通過http請求獲取頁面
lxml：是python的一個解析庫，支持HTML和XML的解析，支持XPath解析方式，而且解析效率非常高
Beautiful Soup4：可以從HTML或XML文件中提取數據

在終端中分別輸入以下pip命令，安裝它們

<code>python
 -m pip install beautifulsoup4
python
 -m pip install lxml
python
 -m pip install requests
/<code>

具體實現思路

利用 Python 的 urllib 模塊獲取網頁的全部內容

利用 Python 的 re 模塊進行網頁正則分析，找到目標壁紙

利用 Python 的 urllib 模塊進行壁紙下載，保存至某一特定文件夾

將該文件夾設置成壁紙，可以實現系統壁紙每天自動更新哦

Requests Headers裡參數含義

User-Agent：這裡面存放瀏覽器的信息。如果後臺設計者驗證這個User-Agent參數是否合法，不讓帶Python字樣的User-Agent訪問，這樣就起到了反爬蟲的作用。這是一個最簡單的，最常用的反爬蟲手段。
Referer：這個參數也可以用於反爬蟲，它表示這個請求是從哪發出的。如果後臺設計者，驗證這個參數，對於不是從這個地址跳轉過來的請求一律禁止訪問，這樣就也起到了反爬蟲的作用。
authorization：這個參數是基於AAA模型中的身份驗證信息允許訪問一種資源的行為。在我們用瀏覽器訪問的時候，服務器會為訪問者分配這個用戶ID。如果後臺設計者，驗證這個參數，對於沒有用戶ID的請求一律禁止訪問，這樣就又起到了反爬蟲的作用。

提取HTTP代理IP

自定義選擇提取格式，ip數量，支持協議、端口等參數

生成API鏈接，調用HTTP GET請求即可返回所需的IP結果* 可以直接按照以下格式組裝所需的API

代碼實現

設置全局變量

<code>index
 = 'http://www.netbian.com'  
 
interval
 = 10  
firstDir
 = 'D:/zgh/Pictures/netbian'  
classificationDict
 = {} /<code>

獲取頁面篩選後的內容列表

url：該網頁的url

select：選擇器（與CSS中的選擇器無縫對接，我很喜歡，定位到HTML中相應的元素）

返回一個列表

<code>def 
screen
(url, 
select
):
html
 = requests.get(url = url, headers = UserAgent.get_headers())  
html.encoding = 'gbk'
html = html.text
soup = BeautifulSoup(html, 'lxml')
return
 soup.select
(select
)/<code>

獲取全部分類的url

<code> 
def init_classification():
    url = index
    select
 = '#header > div.head > ul > li:nth-child(1) > div > a'
    classifications = screen(url
, select)
    for
 c in classifications:
        href = c.get('href')  
        text = c.string  
        if
(text
 == '4k壁紙'):  
            continue
        secondDir = firstDir + '/'
 + text  
        url
 = index + href  
        global classificationDict
        classificationDict[text] = {
            'path': secondDir,
            'url'
: url
        }/<code>

定位圖片並下載

<code> 
def handleImgs(links, path):
    for 
 link in links:
        href = link.get('href')
        if
(href == 'http://pic.netbian.com/'):  
            continue

         
        if
('http://' in href):  
            url = href
        else:
            url = index + href
        select
 = 'div#main div.endpage div.pic div.pic-down a'
        link
 = screen(url, select)
        if
(link == []):
            print
(url + ' 無此圖片，爬取失敗')
            continue
        href = link
[0
].get('href')

         
        url = index + href

         
        select
 = 'div#main table a img'
        link 
 = screen(url, select)
        if
(link == []):
            print
(url + " 該圖片需要登錄才能爬取，爬取失敗")
            continue
        name = link
[0
].get('alt'
).replace('\t'
, ''
).replace('|'
, ''
).replace(':'
, ''
).replace('\\'
, ''
).replace('/'
, ''
).replace('*'
, ''
).replace('?'
, ''
).replace('"'
,  
''
).replace(', 
''
).replace('>'
, '')
        print(name)  
        src = link
[0
].get('src')
        if
(requests.get(src).status_code == 404):
            print
(url + ' 該圖片下載鏈接404，爬取失敗')
            print()
            continue
        print()
        download(src, name, path)
        time.sleep(interval)
/<code>

<code># 下載操作
def download(src, name, path):
    if(isinstance(src, str)):
        response = requests.get(src)
        path
 = path
 + '/'
 + name + '.jpg'
        while
(os 
.path
.exists(path)): # 若文件名重複
            path
 = path
.split("."
)[0
] + str(random
.randint(2
, 17
)) + '.'
 + path
.split("."
)[1]
        with open
(path
,'wb') as pic:
            for
 chunk in
 response.iter_content(128):
                pic.write
(chunk)/<code>