豆瓣系列綜合爬蟲,豆瓣電影、書籍、小組、相冊(項目地址源碼)

豆瓣電影、書籍、小組、相冊、東西等爬蟲。

代碼地址:私信發送:“豆瓣爬蟲”,系統自動回覆下載地址。文章裡面不能放下載地址,只能這樣。

###依賴服務

  1. MongoDB

###依賴包

  1. pip install scrapy

  2. pip install pybloom

  3. pip install pymongo

###運行豆瓣電影爬蟲

  1. 進入douban/movie目錄

  2. 執行scrapy crawl movie

###運行豆瓣相冊爬蟲

  1. 進入douban/album目錄

  2. 執行scrapy crawl album


主要代碼展示:

1#encoding: utf-8

2 from scrapy import Field, Item

3

4 class MovieItem(Item):

5 subject_id = Field()

6 name = Field()

7 year = Field()

8 directors = Field()

9 actors = Field()

10 languages = Field()

11 genres = Field() #類型

12 runtime = Field()

13 stars = Field() #5星 4星 3星 2星 1星各個數量, 次序為:5 4 3 2 1

14 channel = Field()

15 average = Field() #平均分

16 vote = Field() #評分人數

17 tags = Field()

18 watched = Field() #看過

19 wish = Field() #想看

20 comment = Field() #短評數

21 question = Field() #提問數

22 review = Field() #影評數

23 discussion = Field() #討論

24 image = Field() #圖片數

25 countries = Field() #製片國家

26 summary = Field()

27

28

29 #豆瓣相冊 文檔格式

30 AlbumItem = dict(

31 from_url = "http://www.douban.com/photos/album/135640217/",

32 album_name = "少年聽雨歌樓上,壯年畫雨客舟中",

33 author = dict(

34 home_page = "http://www.douban.com/people/isotherm/",

35 nickname = "等溫線",

36 avatar = "http://img3.douban.com/icon/u2152074-7.jpg",

37 ),

38 photos = [

39 dict(

40 large_img_url = "http://img3.douban.com/view/photo/photo/public/p2192138220.jpg",

41 like_count = 2,

42 recommend_count = 22,

43 desc = "李子噠粉蒸排骨!好吃!",

44 comments = [

45 dict(

46 avatar = "http://img3.douban.com/icon/u42419518-2.jpg",

47 nickname = "muse",

48 post_datetime = "2014-07-29 08:37:14",

49 content = "看得流口水了",

50 home_page = "http://www.douban.com/people/yijuns89/",

51 ),

52 ]

53 ),

54 ],

55 tags = ["美女", "標籤", "時尚"],

56 recommend_total = 67,

57 like_total = 506,

58 create_date = "2014-07-21",

59 photo_count = 201,

60 follow_count = 37,

61 desc = "蛇蛇蛇 馬馬馬",

62 )

63

64 class AlbumItem(Item):

65 album_name = Field()

66 author = Field()

67 photos = Field()

68 recommend_total = Field()

69 like_total = Field()

70 create_date = Field()

71 from_url = Field()

72 photo_count = Field()

73 follow_count = Field()

74 desc = Field()

75 tags = Field()

76

77

78 class PhotoItem(Item):

79 large_img_url = Field()

80 like_count = Field()

81 recommend_count = Field()

82 desc = Field()


#encoding: utf-8

2 import scrapy

3 from scrapy.contrib.linkextractors import LinkExtractor

4 from scrapy.contrib.spiders import CrawlSpider, Rule

5

6 from misc.store import doubanDB

7 from parsers import *

8

9 class AlbumSpider(CrawlSpider):

10 name = "album"

11 allowed_domains = ["www.douban.com"]

12 start_urls = [

13 "http://www.douban.com/",

14 ]

15

16 rules = (

17 #相冊詳情

18 Rule(LinkExtractor(allow=r"^http://www\.douban\.com/photos/album/\d+/($|\?start=\d+)"),

19 callback="parse_album",

20 follow=True

21 ),

22

23 #照片詳情

24 Rule(LinkExtractor(allow=r"^http://www\.douban\.com/photos/photo/\d+/$"),

25 callback = "parse_photo",

26 follow = True

27 ),

28

29 #豆列集合

30 # Rule(LinkExtractor(allow=r"^http://www\.douban\.com/photos/album/\d+/doulists$"),

31 # follow=True

32 # ),

33

34 #單個豆列

35 Rule(LinkExtractor(allow=r"^http://www\.douban\.com/doulist/\d+/$"),

36 follow=True

37 ),

38 )

39

40 def parse_album(self, response):

41 album_parser = AlbumParser(response)

42 item = dict(album_parser.item)

43

44 if album_parser.next_page: return None

45 spec = dict(from_url = item["from_url"])

46 doubanDB.album.update(spec, {"$set": item}, upsert=True)

47

48 def parse_photo(self, response):

49 single = SinglePhotoParser(response)

50 from_url = single.from_url

51 if from_url is None: return

52 doc = doubanDB.album.find_one({"from_url": from_url}, {"from_url":True})

53

54 item = dict(single.item)

55 if not doc:

56 new_item = {}

57 new_item["from_url"] = from_url

58 new_item["photos"] = item

59 doubanDB.album.save(new_item)

60 else:

61 spec = {"from_url": from_url}

62 doc = doubanDB.album.find_one({"photos.large_img_url": item["large_img_url"]})

63 if not doc:

64 doubanDB.album.update(spec, {"$push": {"photos": item}})

65

66 cp = CommentParser(response)

67 comments = cp.get_comments()

68 if not comments: return

69 large_img_url = item["large_img_url"]

70 spec = {"photos.large_img_url": large_img_url }

71 doubanDB.album.update(spec, {"$set": {"photos.$.comments": comments} }, upsert=True)



代碼地址:私信發送:“豆瓣爬蟲”,系統自動回覆下載地址。文章裡面不能放下載地址,只能這樣。


分享到:


相關文章: