Python实战！多线程批量解锁1080P高清美图頭條網

2021-04-09 07:14:45 佚名

Python是一种面向对象的解释型编程语言，源代码与解释器CPython遵守GPL协议，Python语法简洁清晰。学好爬虫技能，可为后续的大数据分析、挖掘、机器学习等提供重要的数据源。

无私分享全套Python爬虫资料，私信“学习”免费领取哦~~

环境准备

requests：通过http请求获取页面
lxml：是python的一个解析库，支持HTML和XML的解析，支持XPath解析方式，而且解析效率非常高
Beautiful Soup4：可以从HTML或XML文件中提取数据

在终端中分别输入以下pip命令，安装它们

<code>python
 -m pip install beautifulsoup4
python
 -m pip install lxml
python
 -m pip install requests
/<code>

具体实现思路

利用 Python 的 urllib 模块获取网页的全部内容

利用 Python 的 re 模块进行网页正则分析，找到目标壁纸

利用 Python 的 urllib 模块进行壁纸下载，保存至某一特定文件夹

将该文件夹设置成壁纸，可以实现系统壁纸每天自动更新哦

Requests Headers里参数含义

User-Agent：这里面存放浏览器的信息。如果后台设计者验证这个User-Agent参数是否合法，不让带Python字样的User-Agent访问，这样就起到了反爬虫的作用。这是一个最简单的，最常用的反爬虫手段。
Referer：这个参数也可以用于反爬虫，它表示这个请求是从哪发出的。如果后台设计者，验证这个参数，对于不是从这个地址跳转过来的请求一律禁止访问，这样就也起到了反爬虫的作用。
authorization：这个参数是基于AAA模型中的身份验证信息允许访问一种资源的行为。在我们用浏览器访问的时候，服务器会为访问者分配这个用户ID。如果后台设计者，验证这个参数，对于没有用户ID的请求一律禁止访问，这样就又起到了反爬虫的作用。

提取HTTP代理IP

自定义选择提取格式，ip数量，支持协议、端口等参数

生成API链接，调用HTTP GET请求即可返回所需的IP结果* 可以直接按照以下格式组装所需的API

代码实现

设置全局变量

<code>index
 = 'http://www.netbian.com'  
 
interval
 = 10  
firstDir
 = 'D:/zgh/Pictures/netbian'  
classificationDict
 = {} /<code>

获取页面筛选后的内容列表

url：该网页的url

select：选择器（与CSS中的选择器无缝对接，我很喜欢，定位到HTML中相应的元素）

返回一个列表

<code>def 
screen
(url, 
select
):
html
 = requests.get(url = url, headers = UserAgent.get_headers())  
html.encoding = 'gbk'
html = html.text
soup = BeautifulSoup(html, 'lxml')
return
 soup.select
(select
)/<code>

获取全部分类的url

<code> 
def init_classification():
    url = index
    select
 = '#header > div.head > ul > li:nth-child(1) > div > a'
    classifications = screen(url
, select)
    for
 c in classifications:
        href = c.get('href')  
        text = c.string  
        if
(text
 == '4k壁纸'):  
            continue
        secondDir = firstDir + '/'
 + text  
        url
 = index + href  
        global classificationDict
        classificationDict[text] = {
            'path': secondDir,
            'url'
: url
        }/<code>

定位图片并下载

<code> 
def handleImgs(links, path):
    for 
 link in links:
        href = link.get('href')
        if
(href == 'http://pic.netbian.com/'):  
            continue

         
        if
('http://' in href):  
            url = href
        else:
            url = index + href
        select
 = 'div#main div.endpage div.pic div.pic-down a'
        link
 = screen(url, select)
        if
(link == []):
            print
(url + ' 无此图片，爬取失败')
            continue
        href = link
[0
].get('href')

         
        url = index + href

         
        select
 = 'div#main table a img'
        link 
 = screen(url, select)
        if
(link == []):
            print
(url + " 该图片需要登录才能爬取，爬取失败")
            continue
        name = link
[0
].get('alt'
).replace('\t'
, ''
).replace('|'
, ''
).replace(':'
, ''
).replace('\\'
, ''
).replace('/'
, ''
).replace('*'
, ''
).replace('?'
, ''
).replace('"'
,  
''
).replace(', 
''
).replace('>'
, '')
        print(name)  
        src = link
[0
].get('src')
        if
(requests.get(src).status_code == 404):
            print
(url + ' 该图片下载链接404，爬取失败')
            print()
            continue
        print()
        download(src, name, path)
        time.sleep(interval)
/<code>

<code># 下载操作
def download(src, name, path):
    if(isinstance(src, str)):
        response = requests.get(src)
        path
 = path
 + '/'
 + name + '.jpg'
        while
(os 
.path
.exists(path)): # 若文件名重复
            path
 = path
.split("."
)[0
] + str(random
.randint(2
, 17
)) + '.'
 + path
.split("."
)[1]
        with open
(path
,'wb') as pic:
            for
 chunk in
 response.iter_content(128):
                pic.write
(chunk)/<code>