python實現自動爬取電商平臺上的商品評論數據技术頭條網

python實現自動爬取電商平臺上的商品評論數據

本項目目的是抓取電商平臺上的商品評論數據，用於用戶數據分析。其實就是為此目的編寫一個爬蟲。要實現這個目的：

我們需要通過chrome瀏覽器分析頁面代碼，找到相應的關鍵標籤，這些標籤對應相應的數據。哪些是變化的，哪些是不變的，這個對於自動化實現深層頁面搜索有幫助，所以要認真分析。

亞馬遜商品評論頁面數據分析，黑體標籤就是我們想要抓取得內容，下面我通過公司產品作為例子來說明步驟:

1) rating-star:

2) review-title: A Significant Entry by Jabra

3) review-author: ;

4) review-date: on June 9, 2017

5) review-body: xxxxxxxxxxxxxxxxxxxxx

頁面鏈接信息，舉例JOBS的評論頁

1# page:

https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1

2# page:

https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2

3# page:

https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_3?ie=UTF8&reviewerType=all_reviews&pageNumber=3

4# page

https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_4?ie=UTF8&reviewerType=all_reviews&pageNumber=4

從上面的評論頁面鏈接可以發現頁面和地址的相關性，只是pageNumber在變化，所以可以用以下代碼實現歷遍。

# 歷遍所以頁面
total_page=
for i in total_page:
 JOBS_review_link="https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_"+str(i)+"?ie=UTF8&reviewerType=all_reviews&pageNumber="+str(i)
# the draft code
from lxml import html
import csv,os,json
import requests
from exceptions import ValueError
from time import sleep
def AmazonParser(url):
 #simulate the browser accessing the website to avoid reject scrapy data of the target website.
 headers={'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36'}
 page = requests.get(url, headers=headers)
 while True:
 sleep(3)
 try:
 total_page=
 for i in total_page:
 JOBS_review_link = "https://www.amazon.com/Jabra-Elite-Sport-Wireless-Earbuds/product-reviews/B01N53RO3X/ref=cm_cr_getr_d_paging_btm_"+str(i)+"?ie=UTF8&reviewerType=all_reviews&pageNumber="+str(i)
 page=urllib.requests.urlopen(JOBS_review_link)
 html=page.read()
 return html
 except:
 print "getHtml2 error"

再看看怎麼抓取京東上的評論數據。一樣先通過chrome的開發工具頁面獲取productCommentPage的URL地址，以JSON方式獲取原始數據。獲取想要抓取的商品目錄列表並保存到excel當中，像這樣：

然後通過函數調用讀取excel裡面的商品數據，按需要抓取網頁就行

# -*- coding: utf-8 -*-
"""
Created on Jan-22-2018
@author: jerry zhong
"""
import urllib.request
import json
import time
import random
import csv
from op_excel1 import excel_table_byindex
def crawlProductComment(url,product="",fetchJson=""):
 #讀取原始數據(注意選擇gbk編碼方式)，並去掉多餘的字符
 html = urllib.request.urlopen(url).read().decode('gbk')
 html = html.replace(fetchJson+'(','')
 html = html.replace(');','')
 
 #獲取原始數據並根據json規則讀取，並存入data字典
 data=json.loads(html)
 
 #遍歷商品評論列表 
 for i in data['comments']:
 nickName = i['nickname']
 Score = str(i['score'])
 userClientShow = i['userClientShow']
 productColor = i['productColor']
 isMobile = str(i['isMobile'])
 commentTime = i['creationTime']
 content = i['content']
 #輸出商品評論關鍵信息
 try:
 with open(product+'.tsv', 'a', encoding = 'utf-8') as fh:
 fh.write(nickName +"\\t"+Score+"\\t"+userClientShow +"\\t"+productColor+"\\t"+isMobile+"\\t"+commentTime+"\\t"+content+"\\n")
 except IOError:
 print ("Fail to write the data into file or the list index is out of range.")
if __name__=='__main__':
 print("please input the product_list name: ")
 file = input()+'.xlsx' 

 table = excel_table_byindex(file)
 #input the number of the row
 print ("Please input the row number of the product: ")
 product_row_num = int(input())
 product_row = table[product_row_num]
 #to get the number of page:
 page_number = int(product_row["pages"])
 #to get the product name
 product = product_row["product"]
 #to get the fetchJson
 fetchJson = product_row['fetchJson']
 #to get the url of the product
 url_1 = product_row["url_1"]
 url_2 = product_row["url_2"]
 for i in range(0,page_number+1):
 print("正在下載第{}頁數據...".format(i+1))
 #京東商品評論鏈接,here is the example of the comments for Jabra products
 url = url_1 + str(i) + url_2
 crawlProductComment(url,product,fetchJson)
 #設置休眠時間
 time.sleep(random.randint(10,15))