Python爬蟲教程:採集快手用戶所有公開作品,包括圖集和視頻!

寫在前面

  • 代碼功能如題:根據快手用戶的id來爬取用戶所有公開作品,包括圖集和視頻。
  • 原理:其實就是利用基於chromium內核的瀏覽器自帶的devtools對所有請求進行排查找出包含作品鏈接的請求,然後用代碼模擬請求去獲得數據,再根據url下載作品保存就行了,包括一些網站的自動註冊登錄、操作都可以模擬。這個其實應該算是寫過爬蟲的同學們都知道。
Python爬蟲教程:採集快手用戶所有公開作品,包括圖集和視頻!

核心代碼

  • 廢話不多說,上核心代碼
<code>def __crawl_user(self, uid):

if

uid.isdigit(): uid = self.__switch_id(uid) payload = {

"operationName"

:

"privateFeedsQuery"

,

"variables"

: {

"principalId"

: uid,

"pcursor"

:

""

,

"count"

:

999

},

"query"

:

"query privateFeedsQuery($principalId: String, $pcursor: String, $count: Int) {\n privateFeeds(principalId: $principalId, pcursor: $pcursor, count: $count) {\n pcursor\n list {\n id\n thumbnailUrl\n poster\n workType\n type\n useVideoPlayer\n imgUrls\n imgSizes\n magicFace\n musicName\n caption\n location\n liked\n onlyFollowerCanComment\n relativeHeight\n timestamp\n width\n height\n counts {\n displayView\n displayLike\n displayComment\n __typename\n }\n user {\n id\n eid\n name\n avatar\n __typename\n }\n expTag\n __typename\n }\n __typename\n }\n}\n"

} res = requests.post(self.__data_url, headers=self.__headers, json=payload) works = json.loads(res.content.decode(encoding=

'utf-8'

, errors=

'strict'

))[

'data'

][

'privateFeeds'

][

'list'

]

if

not

os

.

path

.exists(

"../data"

):

os

.makedirs(

"../data"

) # 這兩行代碼將response寫入json供分析 # with

open

(

"data/"

+ uid +

".json"

,

"w"

) as fp: # fp.

write

(json.dumps(works, indent=

2

)) # 防止該用戶在直播,第一個作品默認為直播,導致獲取信息為NoneType

if

works[

0

][

'id'

] is None: works.pop(

0

) name = re.

sub

(r

'[\\/:*?"<>|\r\n]+'

,

""

, works[

0

][

'user'

][

'name'

]) dir =

"data/"

+ name +

"("

+ uid +

")/"

#

print

(

len

(works))

if

not

os

.

path

.exists(dir):

os

.makedirs(dir) #

if

not

os

.

path

.exists(dir +

".list"

): #

print

(

""

)

print

(

"開始爬取用戶 "

+ name +

",保存在目錄 "

+ dir)

print

(

" 共有"

+ str(

len

(works)) +

"個作品"

)

for

j

in

range(

len

(works)): self.__crawl_work(uid, dir, works[j], j +

1

)

time

.sleep(

1

)

print

(

"用戶 "

+ name +

"爬取完成!"

)

print

()

time

.sleep(

1

)/<code>

快手分為五種類型的作品,在作品裡面表現為workType屬性

  • 其中兩種圖集: vertical和multiple,意味著拼接長圖和多圖,所有圖片的鏈接在imgUrls裡
  • 一種單張圖片: single 圖片鏈接也在imgUrls裡
  • K歌: ksong 圖片鏈接一樣,不考慮爬取音頻...
  • 視頻: video 需要解析html獲得視頻鏈接


Python爬蟲教程:採集快手用戶所有公開作品,包括圖集和視頻!


<code>

def

__crawl_work

(self, uid, dir, work, wdx)

:

w_type = work[

'workType'

] w_caption = re.sub(

r"\s+"

,

" "

, work[

'caption'

]) w_name = re.sub(

r'[\/:*?"<>|\r\n]+'

,

""

, w_caption)[

0

:

24

] w_time = time.strftime(

'%Y-%m-%d'

, time.localtime(work[

'timestamp'

] /

1000

))

if

w_type ==

'vertical'

or

w_type ==

'multiple'

or

w_type ==

"single"

or

w_type ==

'ksong'

: w_urls = work[

'imgUrls'

] l = len(w_urls) print(

" "

+ str(wdx) +

")圖集作品:"

+ w_caption +

","

+

"共有"

+ str(l) +

"張圖片"

)

for

i

in

range(l): p_name = w_time +

"_"

+ w_name +

"_"

+ str(i +

1

) +

".jpg"

pic = dir + p_name

if

not

os.path.exists(pic): r = requests.get(w_urls[i]) r.raise_for_status()

with

open(pic,

"wb"

)

as

f: f.write(r.content) print(

" "

+ str(i +

1

) +

"/"

+ str(l) +

" 圖片 "

+ p_name +

" 下載成功 √"

)

else

: print(

" "

+ str(i +

1

) +

"/"

+ str(l) +

" 圖片 "

+ p_name +

" 已存在 √"

)

elif

w_type ==

'video'

: w_url = self.__work_url + work[

'id'

] res = requests.get(w_url, headers=self.__headers_mobile, params={

"fid"

:

1841409882

,

"cc"

:

"share_copylink"

,

"shareId"

:

"143108986354"

}) html = res.text waitreplace = work[

'id'

] +

'".*?"srcNoMark":"(.*?)"'

v_url = re.findall(waitreplace, html)

try

: print(

" "

+ str(wdx) +

")視頻作品:"

+ w_caption)

except

: print(

" 這裡似乎有點小錯誤,已跳過"

) v_name = w_time +

"_"

+ w_name +

".mp4"

video = dir + v_name

if

v_url:

if

not

os.path.exists(video): r = requests.get(v_url[

0

]) r.raise_for_status()

with

open(video,

"wb"

)

as

f: f.write(r.content) print(

" 視頻 "

+ v_name +

" 下載成功 √"

)

else

: print(

" 視頻 "

+ v_name +

" 已存在 √"

)

else

: print(

"未找到視頻"

)

else

: print(

"錯誤的類型"

)/<code>
  • payload就是post參數,這個是在devtools的request請求底下可以找到的
  • 其實就是解析json,然後裡面有圖片的url和視頻的id,我註釋掉的兩行代碼可以保存完整的json的,你可以去掉註釋然後看分析保存的json
  • 剩下的看源碼吧,不難理解的

  • Python爬蟲教程:採集快手用戶所有公開作品,包括圖集和視頻!


    注意事項:

    • 不考慮提供列表可選的批量下載功能
    • 有需要的合理功能可以issue反饋,看到後會考慮是否修改
    • 如果需要自定義自己的需求,可以拿走代碼自行修改,喜歡的話給個star給個follow
    • 本代碼僅供學習使用,不可違反法律爬取視頻,以及私自盜用搬運視頻,後果自負

    項目源碼地址 https://github.com/oGsLP/kuaishou-crawler


    分享到:


    相關文章: