豆瓣電影、書籍、小組、相冊、東西等爬蟲。
代碼地址:私信發送:“豆瓣爬蟲”,系統自動回覆下載地址。文章裡面不能放下載地址,只能這樣。
###依賴服務
MongoDB
###依賴包
pip install scrapy
pip install pybloom
pip install pymongo
###運行豆瓣電影爬蟲
進入douban/movie目錄
執行scrapy crawl movie
###運行豆瓣相冊爬蟲
進入douban/album目錄
執行scrapy crawl album
主要代碼展示:
1#encoding: utf-8
2 from scrapy import Field, Item
3
4 class MovieItem(Item):
5 subject_id = Field()
6 name = Field()
7 year = Field()
8 directors = Field()
9 actors = Field()
10 languages = Field()
11 genres = Field() #類型
12 runtime = Field()
13 stars = Field() #5星 4星 3星 2星 1星各個數量, 次序為:5 4 3 2 1
14 channel = Field()
15 average = Field() #平均分
16 vote = Field() #評分人數
17 tags = Field()
18 watched = Field() #看過
19 wish = Field() #想看
20 comment = Field() #短評數
21 question = Field() #提問數
22 review = Field() #影評數
23 discussion = Field() #討論
24 image = Field() #圖片數
25 countries = Field() #製片國家
26 summary = Field()
27
28
29 #豆瓣相冊 文檔格式
30 AlbumItem = dict(
31 from_url = "http://www.douban.com/photos/album/135640217/",
32 album_name = "少年聽雨歌樓上,壯年畫雨客舟中",
33 author = dict(
34 home_page = "http://www.douban.com/people/isotherm/",
35 nickname = "等溫線",
36 avatar = "http://img3.douban.com/icon/u2152074-7.jpg",
37 ),
38 photos = [
39 dict(
40 large_img_url = "http://img3.douban.com/view/photo/photo/public/p2192138220.jpg",
41 like_count = 2,
42 recommend_count = 22,
43 desc = "李子噠粉蒸排骨!好吃!",
44 comments = [
45 dict(
46 avatar = "http://img3.douban.com/icon/u42419518-2.jpg",
47 nickname = "muse",
48 post_datetime = "2014-07-29 08:37:14",
49 content = "看得流口水了",
50 home_page = "http://www.douban.com/people/yijuns89/",
51 ),
52 ]
53 ),
54 ],
55 tags = ["美女", "標籤", "時尚"],
56 recommend_total = 67,
57 like_total = 506,
58 create_date = "2014-07-21",
59 photo_count = 201,
60 follow_count = 37,
61 desc = "蛇蛇蛇 馬馬馬",
62 )
63
64 class AlbumItem(Item):
65 album_name = Field()
66 author = Field()
67 photos = Field()
68 recommend_total = Field()
69 like_total = Field()
70 create_date = Field()
71 from_url = Field()
72 photo_count = Field()
73 follow_count = Field()
74 desc = Field()
75 tags = Field()
76
77
78 class PhotoItem(Item):
79 large_img_url = Field()
80 like_count = Field()
81 recommend_count = Field()
82 desc = Field()
#encoding: utf-8
2 import scrapy
3 from scrapy.contrib.linkextractors import LinkExtractor
4 from scrapy.contrib.spiders import CrawlSpider, Rule
5
6 from misc.store import doubanDB
7 from parsers import *
8
9 class AlbumSpider(CrawlSpider):
10 name = "album"
11 allowed_domains = ["www.douban.com"]
12 start_urls = [
13 "http://www.douban.com/",
14 ]
15
16 rules = (
17 #相冊詳情
18 Rule(LinkExtractor(allow=r"^http://www\.douban\.com/photos/album/\d+/($|\?start=\d+)"),
19 callback="parse_album",
20 follow=True
21 ),
22
23 #照片詳情
24 Rule(LinkExtractor(allow=r"^http://www\.douban\.com/photos/photo/\d+/$"),
25 callback = "parse_photo",
26 follow = True
27 ),
28
29 #豆列集合
30 # Rule(LinkExtractor(allow=r"^http://www\.douban\.com/photos/album/\d+/doulists$"),
31 # follow=True
32 # ),
33
34 #單個豆列
35 Rule(LinkExtractor(allow=r"^http://www\.douban\.com/doulist/\d+/$"),
36 follow=True
37 ),
38 )
39
40 def parse_album(self, response):
41 album_parser = AlbumParser(response)
42 item = dict(album_parser.item)
43
44 if album_parser.next_page: return None
45 spec = dict(from_url = item["from_url"])
46 doubanDB.album.update(spec, {"$set": item}, upsert=True)
47
48 def parse_photo(self, response):
49 single = SinglePhotoParser(response)
50 from_url = single.from_url
51 if from_url is None: return
52 doc = doubanDB.album.find_one({"from_url": from_url}, {"from_url":True})
53
54 item = dict(single.item)
55 if not doc:
56 new_item = {}
57 new_item["from_url"] = from_url
58 new_item["photos"] = item
59 doubanDB.album.save(new_item)
60 else:
61 spec = {"from_url": from_url}
62 doc = doubanDB.album.find_one({"photos.large_img_url": item["large_img_url"]})
63 if not doc:
64 doubanDB.album.update(spec, {"$push": {"photos": item}})
65
66 cp = CommentParser(response)
67 comments = cp.get_comments()
68 if not comments: return
69 large_img_url = item["large_img_url"]
70 spec = {"photos.large_img_url": large_img_url }
71 doubanDB.album.update(spec, {"$set": {"photos.$.comments": comments} }, upsert=True)
代碼地址:私信發送:“豆瓣爬蟲”,系統自動回覆下載地址。文章裡面不能放下載地址,只能這樣。
閱讀更多 Python樂園 的文章