Python 網頁抓取與數據可視化練習：‘金三銀四’ 是真的嗎？

2020-03-23 16:09:59 地表嘴強程序員

年之計在於春，2020 的春天因為疫情可能改變了許多人的計劃，如三四月份是企業傳統招聘高峰期之一，再有許多帥小夥過年拜見了丈母孃催促著得買房等，職場與樓市素有 ‘金三銀四’ 的說法，然而，這是真的嗎？

最近又學習了一下 Python（為什麼是又？因為學了就忘..），想到何不簡單驗證一下，畢竟數據不會撒謊。

主要流程：

選取樓市情況作為分析對象，與目前公司業務有點相關性。
從 武漢市住房保障和房屋管理局 網站獲取公開的新建商品房成交統計數據。
讀取數據並可視化，結合圖表簡要分析得出初步結論。

先貼最終生成的可視化數據圖：

Step 1：獲取數據

先使用 ‘為人類設計的 HTTP 庫’ - requests 從房管局網站上獲取包含公開成交統計數據的 HTML 頁面，數據分為按日統計發佈的及按月統計發佈的。然後使用 HTML 與 XML 處理庫 lxml 解析 HTML 頁面內容，分析後通過合適的 xpath 提取所需數據。

最開始我的想法是讀取每日數據再分別計算出每個月的數據，爬完後發現目錄頁下面緊挨著的就是按月統計數據（笑哭.jpg ，但是按月的數據只發布到了2019年11月，連整兩年都湊不足可不行，於是結合按日統計數據（發佈到了2020年01月23日）計算出了2019年12月的數據，果然人生沒有白走的路：）

<code>import requests
import lxml.html
import html
import time

import db_operator

def get_all_monthly_datas():
    """按月獲取所有成交數據"""
    # 索引頁（商品住房銷售月度成交統計）
    index_url = 'http://fgj.wuhan.gov.cn/spzfxysydjcjb/index.jhtml'
    max_page = get_max_page(index_url)
    if max_page > 0:
        print('共 ' + str(max_page) + ' 頁，' + '開始獲取月度數據..\\n') 

        for index in range(1, max_page + 1):
            if index >= 2:
                index_url = 'http://fgj.wuhan.gov.cn/spzfxysydjcjb/index_' + str(index) + '.jhtml'
            detail_urls = get_detail_urls(index, index_url)
            for detail_url in detail_urls:
                print('正在獲取月度統計詳情：' + detail_url)
                monthly_detail_html_str = request_html(detail_url)
                if monthly_detail_html_str:
                    find_and_insert_monthly_datas(monthly_detail_html_str)
    else:
        print('總頁數為0。')


def request_html(target_url):
    """請求指定 url 頁面"""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Mobile Safari/537.36',
    }
    html_response = requests.get(target_url, headers=headers)
    html_bytes = html_response.content
    html_str = html_bytes.decode()
    return html_str


def get_max_page(index_url) -> int:
    """從索引頁中獲取總頁數"""
    print('獲取總頁數中..')
    index_html_str = request_html(index_url)
    selector = lxml.html.fromstring(index_html_str)
    max_page_xpath = '//div[@class="whj_padding whj_color pages"]/text()'
    result = selector.xpath(max_page_xpath)
    if result and len(result) > 0:
        result = result[0]
        index_str = result.replace('\\r', '').replace('\\n', '').replace('\\t', '')
        max_page = index_str.split('\\\\xa0')[0]
        max_page = max_page.split('/')[1]
        return int(max_page)
    return 0


def get_detail_urls(index, index_url):
    """獲取統計數據詳情頁 url 列表"""
    print('正在獲取統計列表頁面數據:' + index_url + '\\n')
    index_html_str = request_html(index_url) 

    selector = lxml.html.fromstring(index_html_str)
    # 提取 url 列表。
    # 疑問：這裡使用 '//div[@class="fr hers"]/ul/li/a/@href' 期望應該能提取到更準確的數據，但是結果為空
    detail_urls_xpath = '//div/ul/li/a/@href'
    detail_urls = selector.xpath(detail_urls_xpath)
    return detail_urls


複製代碼/<code>

Stp 2：保存數據

獲取到數據後需要保存下來，以便後續的數據處理與增量更新等。這裡使用與 Python 相親相愛的文檔型數據庫 MongoDB 存儲數據。

踩坑：對於 macOS 系統網上許多 MongoDB 安裝說明已經失效，需要參考 mongodb/homebrew-brew 引導安裝。

啟動服務後就可以寫入數據：

<code>from pymongo import MongoClient
from pymongo import collection
from pymongo import database

client: MongoClient = MongoClient()
db_name: str = 'housing_deal_data'
col_daily_name: str = 'wuhan_daily'
col_monthly_name: str = 'wuhan_monthly'
database: database.Database = client[db_name]
col_daily: collection = database[col_daily_name]
col_monthly: collection = database[col_monthly_name]


def insert_monthly_data(year_month, monthly_commercial_house):
    """寫入月度統計數據"""
    query = {'year_month': year_month} 

    existed_row = col_monthly.find_one(query)
    try:
        monthly_commercial_house_value = int(monthly_commercial_house)
    except:
        if existed_row:
            print('月度數據已存在 =>')
            col_monthly.delete_one(query)
            print('已刪除：月度成交數不符合期望。\\n')
        else:
            print('忽略：月度成交數不符合期望。\\n')
    else:
        print(str({year_month: monthly_commercial_house_value}))
        item = {'year_month': year_month,
                'commercial_house': monthly_commercial_house_value,}
        if existed_row:
            print('月度數據已存在 =>')
            new_values = {'$set': item}
            result = col_monthly.update_one(query, new_values)
            print('更新數據成功：' + str(item) + '\\n' + 'result：' + str(result) + '\\n')
        else:
            result = col_monthly.insert_one(item)
            print('寫入數據成功：' + str(item) + '\\n' + 'result：' + str(result) + '\\n')


複製代碼/<code>

由於在實踐過程中提取數據限制不夠嚴格導致前期寫入了一些髒數據，所以這裡除了正常的 insert 、 update 之外，還有一個 try-except 用來清理髒數據。

Step 3：讀取數據

獲取並保存數據執行完成後，使用 MongoDB GUI 工具 Robo 3T 查看，總體確認數據完整基本符合期望。

接下來從數據庫讀取數據：

<code>def read_all_monthly_datas():
    """從數據庫讀取所有月度統計數據"""
    return {"2018年": read_monthly_datas('2018'),
            "2019年": read_monthly_datas('2019'),}


def read_monthly_datas(year: str) -> list:
    """從數據庫讀取指定年份的月度統計數據"""
    query = {'year_month': {'$regex': '^' + year}}
    result = col_monthly.find(query).limit(12).sort('year_month')

    monthly_datas = {}
    for data in result:
        year_month = data['year_month']
        commercial_house = data['commercial_house']
        if commercial_house > 0:
            month_key = year_month.split('-')[1]
            monthly_datas[month_key] = data['commercial_house']

    # 如果讀取結果小於 12，即有月度數據缺失，則嘗試讀取每日數據並計算出該月統計數據
    if len(monthly_datas) < 12:
        for month in range(1, 13):
            month_key = "{:0>2d}".format(month)
            if month_key not in monthly_datas.keys():
                print('{}年{}月 數據缺失..'.format(year, month_key))
                commercial_house = get_month_data_from_daily_datas(year, month_key)
                if commercial_house > 0:
                    monthly_datas[month_key] = commercial_house
    return monthly_datas


def get_month_data_from_daily_datas(year: str, month: str) -> int:
    """從每日數據中計算月度統計數據"""
    print('從每日數據中獲取 {}年{}月 數據中..'.format(year, month))
    query = {'year_month_day': {'$regex': '^({}-{})'.format(year, month)}}
    result = col_daily.find(query).limit(31)
    sum = 0
    for daily_data in result:
        daily_num = daily_data['commercial_house'] 

        sum += daily_num
    print('{}年{}月數據：{}'.format(year, month, sum))
    return sum
 
 
複製代碼/<code>

可以看到讀取月度數據方法中有校驗數據是否完整以及數據缺失則從每日數據中讀取計算相關的邏輯。

Step 4：數據可視化

由於只是練習簡單查看數據總體趨勢，所以沒有想要繪製稍複雜的圖表，使用圖表庫 matplotlib 繪製簡單統計圖：

<code>import matplotlib.pyplot as plt
import html_spider
import db_operator

def generate_plot(all_monthly_datas):
    """生成統計圖表"""
    # 處理漢字未正常顯示問題
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['font.family'] = 'sans-serif'

    # 生成統計圖表
    fig, ax = plt.subplots()
    plt.title(u"商品住宅成交統計數據（武漢）", fontsize=20)
    plt.ylabel(u"成交量", fontsize=14)
    plt.xlabel(u"月份", fontsize=14)
    for year, monthly_datas in all_monthly_datas.items():
        ax.plot(list(monthly_datas.keys()), list(monthly_datas.values()), label=year)
    ax.legend()
    plt.show()

 

# 爬取網頁數據（並寫入數據庫）
# html_spider.get_all_daily_datas()
html_spider.get_all_monthly_datas()
# 讀取數據，生成統計圖表
generate_plot(db_operator.read_all_monthly_datas())


複製代碼/<code>

執行完畢繪製生成的就是開始貼出的數據圖。

源碼加群：850591259

Step 5：簡要分析

結合圖表中過去兩年的數據曲線可以直觀的看出，近兩年每年都是上半年上漲，隨著丈母孃壓力逐步降低到年中該買的買了，沒買的就是不著急的了，數據會回落然後隨著下半年又一撥準備見丈母孃的補充又開始上升。具體來看，2 月份全年最低（猜測是因為過年放寒假），之後穩步上升至 8 月份左右在 9 月份會回落後再次上漲（除了 2018年7月份也有個明顯回落，得查一下是不是當時有政策調控貸款等方面的調整影響）。

針對看3、4月份，都屬於上升區，但全年的高峰其實分別出現在年末與年中。由此可見如果從回暖角度看 ‘金山銀四’ 的說法有一定依據，但如果從高峰期角度看則不盡然。

最終沒有得出一個比較肯定的 YES or NO 的結論，可能很多事的確是沒有明確答案的：）

one more thing

2019 年整體還是明顯高於 2018 年的，不用太擔心樓市走低（擔心也沒啥用，狗頭.jpg

原本這篇練習的標題應該是 - ‘金九銀十’ 是真的嗎？硬是被自己拖成了 ‘金三銀四’，哎，拖延症要不得。

分享到:

閱讀更多 地表嘴強程序員 的文章

關鍵字: 抓取 WebKit 金三銀