python3快速爬取房源信息,並存入mysql數據庫,超詳細

想做一個有趣的項目,首先整理一下思路,如何快速爬取關鍵信息。並且實現自動翻頁功能。


想了想用最常規的requests加上re正則表達式,BeautifulSoup用於批量爬取

<code>import requests
import re
from bs4 import BeautifulSoup
import pymysql/<code>

然後引入鏈接,注意這裡有反爬蟲機制,第一頁必須為https://tianjin.anjuke.com/sale/,後面頁必須為’https://tianjin.anjuke.com/sale/p%d/#filtersort’%page,不然會被機制檢測到為爬蟲,無法實現爬取。這裡實現了翻頁功能。

<code>while page < 11:

# brower.get("https://tianjin.anjuke.com/sale/p%d/#filtersort"%page)
# time.sleep(1)
print ("這是第"+str(page) +"頁")
# proxy=requests.get(pool_url).text
# proxies={
# 'http': 'http://' + proxy
# }
if page==1:
url='https://tianjin.anjuke.com/sale/'
headers={
'referer': 'https://tianjin.anjuke.com/sale/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',

}
else:
url='https://tianjin.anjuke.com/sale/p%d/#filtersort'%page
headers={
'referer': 'https://tianjin.anjuke.com/sale/p%d/#filtersort'%page,
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',

}
# html=requests.get(url,allow_redirects=False,headers=headers,proxies=proxies)
html = requests.get(url, headers=headers)
/<code>

第二步自然是分析網頁以及如何實現自動翻頁,首先找到圖片


python3快速爬取房源信息,並存入mysql數據庫,超詳細

正則表達式走起!

<code>#圖片地址
myjpg=r''


jpg=re.findall(myjpg,html.text)/<code>

照片信息已經完成爬取,接下來依葫蘆畫瓢,把其它信息頁也迅速爬取!

<code>#描述
mytail=r'/<code>

接下來實現用beauitfulsoup實現關鍵字標籤取值!解析器我這裡用lxml,速度比較快,當然也可以用html.parser

<code>soup=BeautifulSoup(html.content,'lxml')/<code>


看圖,這裡用了很多換行符,並且span標籤沒有命名,所以請上我們的嘉賓bs4


python3快速爬取房源信息,並存入mysql數據庫,超詳細

這裡使用了循環,因為是一次性爬取,一個300條信息,由於一頁圖片只有60張,所以將其5個一組進行劃分,re.sub目的為了將其中的非字符信息替換為空以便存入數據庫

<code>#獲取房子信息
itemdetail=soup.select(".details-item span")
# print(len(itemdetail))
you=[]
my=[]
for i in itemdetail:
# print(i.get_text())

you.append(i.get_text())
k = 0
while k < 60:
my.append([you[5 * k], you[5 * k + 1], you[5 * k + 2], you[5 * k + 3],re.sub(r'\\s', "", you[5 * k + 4])])
k = k + 1
# print(my)
# print(len(my))
/<code>

接下來存入數據庫!

<code>db = pymysql.connect("localhost", "root", "" ,"anjuke")
conn = db.cursor()
print(len(jpg))
for i in range(0,len(tail)):
jpgs = jpg[i]
/> localroom = my[i][0]
localarea=my[i][1]
localhigh=my[i][2]
localtimes=my[i][3]
local=my[i][4]
total = mytotal[i]
oneprice=simple[i]
sql = "insert into shanghai_admin value('%s','%s','%s','%s','%s','%s','%s','%s','%s')" % \\
(jpgs,scripts,local,total,oneprice,localroom,localarea,localhigh,localtimes)
conn.execute(sql)
db.commit()
db.close()
/<code>

大功告成!來看看效果!


python3快速爬取房源信息,並存入mysql數據庫,超詳細

以下為完整代碼:

<code># from selenium import webdriver
import requests
import re
from bs4 import BeautifulSoup
import pymysql
# import time
# chrome_driver=r"C:\\Users\\秦QQ\\AppData\\Local\\Programs\\Python\\Python38-32\\Lib\\site-packages\\selenium-3.141.0-py3.8.egg\\selenium\\webdriver\\chrome\\chromedriver.exe"
# brower=webdriver.Chrome(executable_path=chrome_driver)
# pool_url='http://localhost:5555/random'
page=1
while page < 11:


# brower.get("https://tianjin.anjuke.com/sale/p%d/#filtersort"%page)
# time.sleep(1)
print ("這是第"+str(page) +"頁")
# proxy=requests.get(pool_url).text
# proxies={
# 'http': 'http://' + proxy
# }
if page==1:
url='https://tianjin.anjuke.com/sale/'
headers={
'referer': 'https://tianjin.anjuke.com/sale/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',

}
else:
url='https://tianjin.anjuke.com/sale/p%d/#filtersort'%page
headers={
'referer': 'https://tianjin.anjuke.com/sale/p%d/#filtersort'%page,
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',

}
# html=requests.get(url,allow_redirects=False,headers=headers,proxies=proxies)
html = requests.get(url, headers=headers)
soup=BeautifulSoup(html.content,'lxml')
#圖片地址
myjpg=r''


jpg=re.findall(myjpg,html.text)
#描述
mytail=r'/<code>


分享到:


相關文章: