寫在前面
- 代碼功能如題:根據快手用戶的id來爬取用戶所有公開作品,包括圖集和視頻。
- 原理:其實就是利用基於chromium內核的瀏覽器自帶的devtools對所有請求進行排查找出包含作品鏈接的請求,然後用代碼模擬請求去獲得數據,再根據url下載作品保存就行了,包括一些網站的自動註冊登錄、操作都可以模擬。這個其實應該算是寫過爬蟲的同學們都知道。
核心代碼
- 廢話不多說,上核心代碼
<code>def __crawl_user(self, uid):if
uid.isdigit(): uid = self.__switch_id(uid) payload = {"operationName"
:"privateFeedsQuery"
,"variables"
: {"principalId"
: uid,"pcursor"
:""
,"count"
:999
},"query"
:"query privateFeedsQuery($principalId: String, $pcursor: String, $count: Int) {\n privateFeeds(principalId: $principalId, pcursor: $pcursor, count: $count) {\n pcursor\n list {\n id\n thumbnailUrl\n poster\n workType\n type\n useVideoPlayer\n imgUrls\n imgSizes\n magicFace\n musicName\n caption\n location\n liked\n onlyFollowerCanComment\n relativeHeight\n timestamp\n width\n height\n counts {\n displayView\n displayLike\n displayComment\n __typename\n }\n user {\n id\n eid\n name\n avatar\n __typename\n }\n expTag\n __typename\n }\n __typename\n }\n}\n"
} res = requests.post(self.__data_url, headers=self.__headers, json=payload) works = json.loads(res.content.decode(encoding='utf-8'
, errors='strict'
))['data'
]['privateFeeds'
]['list'
]if
not
os
.path
.exists("../data"
):os
.makedirs("../data"
) # 這兩行代碼將response寫入json供分析 # withopen
("data/"
+ uid +".json"
,"w"
) as fp: # fp.write
(json.dumps(works, indent=2
)) # 防止該用戶在直播,第一個作品默認為直播,導致獲取信息為NoneTypeif
works[0
]['id'
] is None: works.pop(0
) name = re.sub
(r'[\\/:*?"<>|\r\n]+'
,""
, works[0
]['user'
]['name'
]) dir ="data/"
+ name +"("
+ uid +")/"
#len
(works))if
not
os
.path
.exists(dir):os
.makedirs(dir) #if
not
os
.path
.exists(dir +".list"
): #""
)"開始爬取用戶 "
+ name +",保存在目錄 "
+ dir)" 共有"
+ str(len
(works)) +"個作品"
)for
jin
range(len
(works)): self.__crawl_work(uid, dir, works[j], j +1
)time
.sleep(1
)"用戶 "
+ name +"爬取完成!"
)time
.sleep(1
)/<code>
快手分為五種類型的作品,在作品裡面表現為workType屬性
- 其中兩種圖集: vertical和multiple,意味著拼接長圖和多圖,所有圖片的鏈接在imgUrls裡
- 一種單張圖片: single 圖片鏈接也在imgUrls裡
- K歌: ksong 圖片鏈接一樣,不考慮爬取音頻...
- 視頻: video 需要解析html獲得視頻鏈接
<code>def
__crawl_work
(self, uid, dir, work, wdx)
: w_type = work['workType'
] w_caption = re.sub(r"\s+"
," "
, work['caption'
]) w_name = re.sub(r'[\/:*?"<>|\r\n]+'
,""
, w_caption)[0
:24
] w_time = time.strftime('%Y-%m-%d'
, time.localtime(work['timestamp'
] /1000
))if
w_type =='vertical'
or
w_type =='multiple'
or
w_type =="single"
or
w_type =='ksong'
: w_urls = work['imgUrls'
] l = len(w_urls) print(" "
+ str(wdx) +")圖集作品:"
+ w_caption +","
+"共有"
+ str(l) +"張圖片"
)for
iin
range(l): p_name = w_time +"_"
+ w_name +"_"
+ str(i +1
) +".jpg"
pic = dir + p_nameif
not
os.path.exists(pic): r = requests.get(w_urls[i]) r.raise_for_status()with
open(pic,"wb"
)as
f: f.write(r.content) print(" "
+ str(i +1
) +"/"
+ str(l) +" 圖片 "
+ p_name +" 下載成功 √"
)else
: print(" "
+ str(i +1
) +"/"
+ str(l) +" 圖片 "
+ p_name +" 已存在 √"
)elif
w_type =='video'
: w_url = self.__work_url + work['id'
] res = requests.get(w_url, headers=self.__headers_mobile, params={"fid"
:1841409882
,"cc"
:"share_copylink"
,"shareId"
:"143108986354"
}) html = res.text waitreplace = work['id'
] +'".*?"srcNoMark":"(.*?)"'
v_url = re.findall(waitreplace, html)try
: print(" "
+ str(wdx) +")視頻作品:"
+ w_caption)except
: print(" 這裡似乎有點小錯誤,已跳過"
) v_name = w_time +"_"
+ w_name +".mp4"
video = dir + v_nameif
v_url:if
not
os.path.exists(video): r = requests.get(v_url[0
]) r.raise_for_status()with
open(video,"wb"
)as
f: f.write(r.content) print(" 視頻 "
+ v_name +" 下載成功 √"
)else
: print(" 視頻 "
+ v_name +" 已存在 √"
)else
: print("未找到視頻"
)else
: print("錯誤的類型"
)/<code>
注意事項:
- 不考慮提供列表可選的批量下載功能
- 有需要的合理功能可以issue反饋,看到後會考慮是否修改
- 如果需要自定義自己的需求,可以拿走代碼自行修改,喜歡的話給個star給個follow
- 本代碼僅供學習使用,不可違反法律爬取視頻,以及私自盜用搬運視頻,後果自負
項目源碼地址 https://github.com/oGsLP/kuaishou-crawler