學習Python之爬蟲實戰練習2:使用Beautiful Soup實現簡單爬蟲

本節主要學習網頁解析器:Beautiful Soup的語法,以及藉助實際的例子來熟悉。

對於要爬取的一個html網頁,

1、首先需要創建Beautiful Soup對象,

2、然後搜索dom節點,方法是find()、find_all()

3、最後是訪問節點,可以通過名稱、屬性、文字來訪問。

已調試運行的代碼:

from bs4 import BeautifulSoupimport rehtml_doc = """The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were , and ;and they lived at the bottom of a well.

...

"""soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8')print("1、獲取所有的鏈接")links = soup.find_all("a")for link in links: print(link.name,link['href'],link.get_text())print("2、獲取指定的鏈接")link_code = soup.find("a",href="http://example.com/tillie")print(link_code.name,link_code['href'],link_code.get_text())print("3、正則表達")link_code = soup.find("a",href=re.compile(r"ill"))print(link_code.name,link_code['href'],link_code.get_text())print("3、正則表達")p_code = soup.find("p",class_="title")print(p_code.name,p_code.get_text())


分享到:


相關文章: