Python爬蟲自學筆記：爬取小說（五）頭條網

2020-12-13 00:22:44 佚名

書接上文，前面代碼實現了根據txt鏈接實現小說下載，本文主要實現根據提供的小說名稱進行網站檢索，返回下載鏈接，並對小說下載。

1 網站分析

網站檢索頁面地址為：https://www.555x.org/search.html

分析檢索頁面中要求輸入書名處，採用post方法，輸入的小說名稱賦予參數searchkey。由此可以採用requests.post()請求，發送字典{"searchkey":"小說名稱"}來獲取網站檢索界面，在返回列表中可以提取小說網址信息。

2 編碼思路

1）提供小說名稱；

2）在小說網站檢索小說，提取小說對應編號；

3）根據編號得出下載鏈接，進而下載小說。

3 代碼實現

源碼如下：

<code># crawl_v1.4
# 爬取小說txt文件

import requests
from bs4 import BeautifulSoup
import time
import proxy_ip

# 獲取小說檢索結果
def get_search(novel, proxy):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, sdch",
        "Accept-Language": "zh-CN,zh;q=0.8",
        "Cache-Control": "max-age=0",
        "Connection": "keep-alive"}
    try:
        r = requests.post("https://www.555x.org/search.html",{"searchkey":novel},headers = headers,proxies = proxy)
        r.raise_for_status()
    except:
        proxy = proxy_ip.get_random_ip()
        print("更換代理IP")
        r = requests.post("https://www.555x.org/search.html",{"searchkey":novel},headers = headers,proxies = proxy)
    soup = BeautifulSoup(r.text,"html.parser")
    qq_g = soup.find_all("li","qq_g")
    link = ""
    for i in qq_g:
        s = i.text.find("》")
        # 提取請求結果的小說全名，並與輸入小說名稱對比，
        # 相同則賦值link鏈接地址並結束循環，不相同則默認link為空
        if i.text[1:s] == novel:
            link = i.a.get("href")
            break
    return link

# 下載小說
def novle_download(novel,n, proxy):
    l = "https://www.555x.org/home/down/txt/id/" + n
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate, sdch",
        "Accept-Language": "zh-CN,zh;q=0.8",
        "Cache-Control": "max-age=0",
        "Connection": "keep-alive"}
    try:
        r = requests.get(l,headers = headers,proxies = proxy)
        r.raise_for_status()
    except:
        proxy = proxy_ip.get_random_ip()
        print("更換代理IP")
        r = requests.get(l, headers=headers, proxies=proxy)
    # 保存小說到本地
    with open(novel + ".txt","w",encoding="ISO-8859-1") as f:
        f.write(r.text)

if __name__ == "__main__":
    start_time = time.time()
    novel = input("輸入小說名稱：")
    proxy = proxy_ip.get_random_ip()
    novel_link = get_search(novel,proxy) # 獲取小說搜索結果
    if novel_link == "":
        print("網站中無此小說")
    else:
        s = novel_link.find("txt")
        e = novel_link.find(".html")
        novel_number = novel_link[s+3:e]    # 提取小說編號
        novle_download(novel,novel_number,proxy)  #下載小說

    # 獲取小說下載時間
    end_time = time.time()
    print("運行時間：" + str(round(end_time - start_time))  + "s")/<code>

運行結果：