python實現自動爬取電商平臺上的商品評論數據

python實現自動爬取電商平臺上的商品評論數據

本項目目的是抓取電商平臺上的商品評論數據,用於用戶數據分析。其實就是為此目的編寫一個爬蟲。要實現這個目的:

我們需要通過chrome瀏覽器分析頁面代碼,找到相應的關鍵標籤,這些標籤對應相應的數據。哪些是變化的,哪些是不變的,這個對於自動化實現深層頁面搜索有幫助,所以要認真分析。

亞馬遜商品評論頁面數據分析,黑體標籤就是我們想要抓取得內容,下面我通過公司產品作為例子來說明步驟:

1) rating-star: 4.0 out of 5 stars

2) review-title: A Significant Entry by Jabra

3) review-author: ;

4) review-date: on June 9, 2017

5) review-body: xxxxxxxxxxxxxxxxxxxxx

頁面鏈接信息,舉例JOBS的評論頁

1# page:

https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1

2# page:

https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2

3# page:

https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_3?ie=UTF8&reviewerType=all_reviews&pageNumber=3

4# page

https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_4?ie=UTF8&reviewerType=all_reviews&pageNumber=4

從上面的評論頁面鏈接可以發現頁面和地址的相關性,只是pageNumber在變化,所以可以用以下代碼實現歷遍。

# 歷遍所以頁面
total_page=
for i in total_page:
JOBS_review_link="https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_"+str(i)+"?ie=UTF8&reviewerType=all_reviews&pageNumber="+str(i)
# the draft code
from lxml import html
import csv,os,json
import requests
from exceptions import ValueError
from time import sleep
def AmazonParser(url):
#simulate the browser accessing the website to avoid reject scrapy data of the target website.
headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
page = requests.get(url, headers=headers)
while True:
sleep(3)
try:
total_page=
for i in total_page:
JOBS_review_link = "https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_"+str(i)+"?ie=UTF8&reviewerType=all_reviews&pageNumber="+str(i)
page=urllib.requests.urlopen(JOBS_review_link)
html=page.read()
return html
except:
print "getHtml2 error"

再看看怎麼抓取京東上的評論數據。一樣先通過chrome的開發工具頁面獲取productCommentPage的URL地址,以JSON方式獲取原始數據。獲取想要抓取的商品目錄列表並保存到excel當中,像這樣:

python實現自動爬取電商平臺上的商品評論數據

然後通過函數調用讀取excel裡面的商品數據,按需要抓取網頁就行

# -*- coding: utf-8 -*-
"""
Created on Jan-22-2018
@author: jerry zhong
"""
import urllib.request
import json
import time
import random
import csv
from op_excel1 import excel_table_byindex
def crawlProductComment(url,product="",fetchJson=""):
#讀取原始數據(注意選擇gbk編碼方式),並去掉多餘的字符
html = urllib.request.urlopen(url).read().decode('gbk')
html = html.replace(fetchJson+'(','')
html = html.replace(');','')

#獲取原始數據並根據json規則讀取,並存入data字典
data=json.loads(html)

#遍歷商品評論列表
for i in data['comments']:
nickName = i['nickname']
Score = str(i['score'])
userClientShow = i['userClientShow']
productColor = i['productColor']
isMobile = str(i['isMobile'])
commentTime = i['creationTime']
content = i['content']
#輸出商品評論關鍵信息
try:
with open(product+'.tsv', 'a', encoding = 'utf-8') as fh:
fh.write(nickName +"\\t"+Score+"\\t"+userClientShow +"\\t"+productColor+"\\t"+isMobile+"\\t"+commentTime+"\\t"+content+"\\n")
except IOError:
print ("Fail to write the data into file or the list index is out of range.")
if __name__=='__main__':
print("please input the product_list name: ")
file = input()+'.xlsx'

table = excel_table_byindex(file)
#input the number of the row
print ("Please input the row number of the product: ")
product_row_num = int(input())
product_row = table[product_row_num]
#to get the number of page:
page_number = int(product_row["pages"])
#to get the product name
product = product_row["product"]
#to get the fetchJson
fetchJson = product_row['fetchJson']
#to get the url of the product
url_1 = product_row["url_1"]
url_2 = product_row["url_2"]
for i in range(0,page_number+1):
print("正在下載第{}頁數據...".format(i+1))
#京東商品評論鏈接,here is the example of the comments for Jabra products
url = url_1 + str(i) + url_2
crawlProductComment(url,product,fetchJson)
#設置休眠時間
time.sleep(random.randint(10,15))

運行後程序自動抓取數據:

python實現自動爬取電商平臺上的商品評論數據

抓取的數據保存到對應的商品文件tsv格式保存,可以很方便的通過excel導入進行下一步分析。按照這個思路,你應該很容易編寫出適合你的爬蟲了。


分享到:


相關文章: