本節主要學習網頁解析器:Beautiful Soup的語法,以及藉助實際的例子來熟悉。
對於要爬取的一個html網頁,
1、首先需要創建Beautiful Soup對象,
2、然後搜索dom節點,方法是find()、find_all()
3、最後是訪問節點,可以通過名稱、屬性、文字來訪問。
已調試運行的代碼:
from bs4 import BeautifulSoupimport rehtml_doc = """The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were , and ;and they lived at the bottom of a well.
...
"""soup = BeautifulSoup(html_doc,'html.parser',from_encoding='utf-8')print("1、獲取所有的鏈接")links = soup.find_all("a")for link in links: print(link.name,link['href'],link.get_text())print("2、獲取指定的鏈接")link_code = soup.find("a",href="http://example.com/tillie")print(link_code.name,link_code['href'],link_code.get_text())print("3、正則表達")link_code = soup.find("a",href=re.compile(r"ill"))print(link_code.name,link_code['href'],link_code.get_text())print("3、正則表達")p_code = soup.find("p",class_="title")print(p_code.name,p_code.get_text())
閱讀更多 不笑扮酷 的文章