豆瓣 py 電影名圖片爬取

好久沒做爬蟲,發覺現在的反爬蟲是做的越來越優秀了,萌新表示惹不起。


豆瓣 py 電影名圖片爬取

廢話不多說,今天要爬的內容一部分成功,另一部分並沒有。

我們先說爬取成功的內容,一週口碑榜


豆瓣 py 電影名圖片爬取


豆瓣 py 電影名圖片爬取

import requests

from lxml import html

headers={

'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',

'Host':'movie.douban.com',

'Referer':'https://movie.douban.com/',

}

cookies={'Cookie':'ll="118258"; bid=WjiGjLlFCwI; __utmc=30149280; __utmc=223695111; __yadk_uid=fy8M46j1iO3PUKWGE75dRPbCqUP9n7CW; _vwo_uuid_v2=D1C730BC2CB2CE6E0FCF3E10961B2672A|e7629eabee29d3cc274a32bb501a5d5b; __gads=ID=4d8dc7b338f043fc:T=1584598234:S=ALNI_MZwFk-uYpnSqY7uHX6hFY1-WQidmQ; __utmz=30149280.1584600178.3.2.utmcsr=blog.csdn.net|utmccn=(referral)|utmcmd=referral|utmcct=/csqazwsxedc/article/details/68498842; __utmz=223695111.1584600178.3.2.utmcsr=blog.csdn.net|utmccn=(referral)|utmcmd=referral|utmcct=/csqazwsxedc/article/details/68498842; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1584607034%2C%22https%3A%2F%2Fblog.csdn.net%2Fcsqazwsxedc%2Farticle%2Fdetails%2F68498842%22%5D; _pk_id.100001.4cf6=694d5c0e1a9117cb.1584593919.5.1584607034.1584602083.; _pk_ses.100001.4cf6=*; __utma=30149280.1550415036.1584593919.1584602084.1584607034.5; __utmb=30149280.0.10.1584607034; __utma=223695111.1269849409.1584593919.1584602084.1584607034.5; __utmb=223695111.0.10.1584607034'}

url='https://movie.douban.com/' #需要爬數據的網址

page=requests.Session().get(url,headers=headers,cookies=cookies)

tree=html.fromstring(page.text)

result=tree.xpath('//td[@class="title"]//a/text()')

for x,y in enumerate(result,start=1):

print(x,' ',y,'\\n')

豆瓣 py 電影名圖片爬取

接下來我們是要爬取所有的頁面上的圖片

於是我便想到了bs4中的BeautifulSoup中的findall 或者selector通過先擇tag標籤來達到目的,然而,事以願為。下面請看我通過xpath和select進行的操作


豆瓣 py 電影名圖片爬取


當運行該程序後,你就會在圖片保存目錄看到十六張圖片,因為我的目錄是保存在程序運行目錄,也就是桌面。顧可看到如下結果:


豆瓣 py 電影名圖片爬取

import requests

from bs4 import BeautifulSoup

headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',

'Host':'movie.douban.com','Referer':'https://movie.douban.com/'}

url='https://movie.douban.com/' #需要爬數據的網址

pa=requests.get(url,headers=headers,cookies=cookies)

soup=BeautifulSoup(pa.content,'lxml')

img=soup.select('img')

#img=soup.find_all('img')

for i in img:

class="lazy" data-original=i.get('src')

if src.endswith('.jpg'):

with open('{}.jpg'.format(src[-10:-5]),'wb') as f:

result=requests.get(src)

f.write(result.content)

if src.endswith('.png'):

with open('{}.png'.format(src[-9:-5]),'wb') as f:

result=requests.get(src)

f.write(result.content)

if src.endswith('.gif'):

with open('{}.gif'.format(src[-9:-5]),'wb') as f:

result=requests.get(src)

f.write(result.content)


很顯然這並不是所有圖片

再看下面的程序


import requests

from lxml import html

ss=tree.xpath('//img')

for y in ss:

class="lazy" data-original=y.xpath('@src')

for x in src:

x=x.replace('[','').replace(']','')

if x.endswith('.jpg'):

with open('{}.jpg'.format(x[-9:-5]),'wb') as f:

li=requests.get(x)

f.write(li.content)

if x.endswith('.gif'):

with open('{}.gif'.format(x[-9:-5]),'wb') as f:

li=requests.get(x)

f.write(li.content)

if x.endswith('.png'):

with open('{}.png'.format(x[-9:-5]),'wb') as f:

li=requests.get(x)

f.write(li.content)


執行後也是與上面同樣的結果,萌新小白表示不懂


分享到:


相關文章: