Python爬蟲庫BeautifulSoup怎么用-創(chuàng)新互聯(lián)

這篇文章主要介紹Python爬蟲庫BeautifulSoup怎么用，文中介紹的非常詳細，具有一定的參考價值，感興趣的小伙伴們一定要看完！

創(chuàng)新互聯(lián)公司堅持“要么做到，要么別承諾”的工作理念，服務領(lǐng)域包括：網(wǎng)站建設(shè)、成都網(wǎng)站設(shè)計、企業(yè)官網(wǎng)、英文網(wǎng)站、手機端網(wǎng)站、網(wǎng)站推廣等服務，滿足客戶于互聯(lián)網(wǎng)時代的鄂爾多斯網(wǎng)站設(shè)計、移動媒體設(shè)計的需求，幫助企業(yè)找到有效的互聯(lián)網(wǎng)解決方案。努力成為您成熟可靠的網(wǎng)絡建設(shè)合作伙伴！

一、介紹

BeautifulSoup庫是靈活又方便的網(wǎng)頁解析庫，處理高效，支持多種解析器。利用它不用編寫正則表達式即可方便地實現(xiàn)網(wǎng)頁信息的提取。

Python常用解析庫

解析器	使用方法	優(yōu)勢	劣勢
Python標準庫	BeautifulSoup(markup, “html.parser”)	Python的內(nèi)置標準庫、執(zhí)行速度適中、文檔容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文容錯能力差
lxml HTML 解析器	BeautifulSoup(markup, “l(fā)xml”)	速度快、文檔容錯能力強	需要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup, “xml”)	速度快、唯一支持XML的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔	速度慢、不依賴外部擴展

二、快速開始

給定html文檔，產(chǎn)生BeautifulSoup對象

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')

輸出完整文本

print(soup.prettify())

<html>
 <head>
 <title>
  The Dormouse's story
 </title>
 </head>
 <body>
 <p class="title">
  <b>
  The Dormouse's story
  </b>
 </p>
 <p class="story">
  Once upon a time there were three little sisters; and their names were
  <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">
  Elsie
  </a>
  ,
  <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">
  Lacie
  </a>
  and
  <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">
  Tillie
  </a>
  ;
and they lived at the bottom of a well.
 </p>
 <p class="story">
  ...
 </p>
 </body>
</html>

瀏覽結(jié)構(gòu)化數(shù)據(jù)

print(soup.title) #<title>標簽及內(nèi)容
print(soup.title.name) #<title>name屬性
print(soup.title.string) #<title>內(nèi)的字符串
print(soup.title.parent.name) #<title>的父標簽name屬性(head)
print(soup.p) # 第一個<p></p>
print(soup.p['class']) #第一個<p></p>的class
print(soup.a) # 第一個<a></a>
print(soup.find_all('a')) # 所有<a></a>
print(soup.find(id="link3")) # 所有id='link3'的標簽

<title>The Dormouse's story</title>
title
The Dormouse's story
head
<p class="title"><b>The Dormouse's story</b></p>
['title']
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>

找出所有標簽內(nèi)的鏈接

for link in soup.find_all('a'):
  print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

獲得所有文字內(nèi)容

print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

自動補全標簽并進行格式化

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.prettify())#格式化代碼，自動補全
print(soup.title.string)#得到title標簽里的內(nèi)容

標簽選擇器

選擇元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.title)#選擇了title標簽
print(type(soup.title))#查看類型
print(soup.head)

獲取標簽名稱

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.title.name)

獲取標簽屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.attrs['name'])#獲取p標簽中，name這個屬性的值
print(soup.p['name'])#另一種寫法，比較直接

獲取標簽內(nèi)容

print(soup.p.string)

標簽嵌套選擇

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.head.title.string)

子節(jié)點和子孫節(jié)點

html = """
<html>
  <head>
    <title>The Dormouse's story</title>
  </head>
  <body>
    <p class="story">
      Once upon a time there were three little sisters; and their names were
      <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">
        <span>Elsie</span>
      </a>
      <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> 
      and
      <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>
      and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
"""


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.contents)#獲取指定標簽的子節(jié)點，類型是list

另一個方法，child：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.children)#獲取指定標簽的子節(jié)點的迭代器對象
for i,children in enumerate(soup.p.children):#i接受索引，children接受內(nèi)容
	print(i,children)

輸出結(jié)果與上面的一樣，多了一個索引。注意，只能用循環(huán)來迭代出子節(jié)點的信息。因為直接返回的只是一個迭代器對象。

獲取子孫節(jié)點：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.descendants)#獲取指定標簽的子孫節(jié)點的迭代器對象
for i,child in enumerate(soup.p.descendants):#i接受索引，child接受內(nèi)容
	print(i,child)

父節(jié)點和祖先節(jié)點

parent

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.a.parent)#獲取指定標簽的父節(jié)點

parents

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(list(enumerate(soup.a.parents)))#獲取指定標簽的祖先節(jié)點

兄弟節(jié)點

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(list(enumerate(soup.a.next_siblings)))#獲取指定標簽的后面的兄弟節(jié)點
print(list(enumerate(soup.a.previous_siblings)))#獲取指定標簽的前面的兄弟節(jié)點

標準選擇器

find_all( name , attrs , recursive , text , **kwargs )

可根據(jù)標簽名、屬性、內(nèi)容查找文檔。

name

html='''
<div class="panel">
  <div class="panel-heading">
    <h5>Hello</h5>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))#查找所有ul標簽下的內(nèi)容
print(type(soup.find_all('ul')[0]))#查看其類型

下面的例子就是查找所有ul標簽下的li標簽：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
  print(ul.find_all('li'))

attrs（屬性）

通過屬性進行元素的查找

html='''
<div class="panel">
  <div class="panel-heading">
    <h5>Hello</h5>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1" name="elements">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))#傳入的是一個字典類型，也就是想要查找的屬性
print(soup.find_all(attrs={'name': 'elements'}))

查找到的是同樣的內(nèi)容，因為這兩個屬性是在同一個標簽里面的。

特殊類型的參數(shù)查找：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))#id是個特殊的屬性，可以直接使用
print(soup.find_all(class_='element')) #class是關(guān)鍵字所以要用class_

text

根據(jù)文本內(nèi)容來進行選擇：

html='''
<div class="panel">
  <div class="panel-heading">
    <h5>Hello</h5>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))#查找文本為Foo的內(nèi)容，但是返回的不是標簽

所以說這個text在做內(nèi)容匹配的時候比較方便，但是在做內(nèi)容查找的時候并不是太方便。

方法

find

find用法和findall一模一樣，但是返回的是找到的第一個符合條件的內(nèi)容輸出。

ind_parents()， find_parent()

find_parents()返回所有祖先節(jié)點，find_parent()返回直接父節(jié)點。

find_next_siblings() ,find_next_sibling()

find_next_siblings()返回后面的所有兄弟節(jié)點，find_next_sibling()返回后面的第一個兄弟節(jié)點

find_previous_siblings(),find_previous_sibling()

find_previous_siblings()返回前面所有兄弟節(jié)點,find_previous_sibling()返回前面第一個兄弟節(jié)點

find_all_next(),find_next()

find_all_next()返回節(jié)點后所有符合條件的節(jié)點，find_next()返回后面第一個符合條件的節(jié)點

find_all_previous(),find_previous()

find_all_previous()返回節(jié)點前所有符合條件的節(jié)點，find_previous()返回前面第一個符合條件的節(jié)點

CSS選擇器通過select()直接傳入CSS選擇器即可完成選擇

html='''
<div class="panel">
  <div class="panel-heading">
    <h5>Hello</h5>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))#.代表class，中間需要空格來分隔
print(soup.select('ul li')) #選擇ul標簽下面的li標簽
print(soup.select('#list-2 .element')) #'#'代表id。這句的意思是查找id為"list-2"的標簽下的，class=element的元素
print(type(soup.select('ul')[0]))#打印節(jié)點類型

再看看層層嵌套的選擇：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
	print(ul.select('li'))

獲取屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
  print(ul['id'])# 用[ ]即可獲取屬性
  print(ul.attrs['id'])#另一種寫法

獲取內(nèi)容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
  print(li.get_text())

用get_text（）方法就能獲取內(nèi)容了。

以上是“Python爬蟲庫BeautifulSoup怎么用”這篇文章的所有內(nèi)容，感謝各位的閱讀！希望分享的內(nèi)容對大家有幫助，更多相關(guān)知識，歡迎關(guān)注創(chuàng)新互聯(lián)成都網(wǎng)站設(shè)計公司行業(yè)資訊頻道！

另外有需要云服務器可以了解下創(chuàng)新互聯(lián)scvps.cn，海內(nèi)外云服務器15元起步，三天無理由+7*72小時售后在線，公司持有idc許可證，提供“云服務器、裸金屬服務器、高防服務器、香港服務器、美國服務器、虛擬主機、免備案服務器”等云主機租用服務以及企業(yè)上云的綜合解決方案，具有“安全穩(wěn)定、簡單易用、服務可用性高、性價比高”等特點與優(yōu)勢，專為企業(yè)上云打造定制，能夠滿足用戶豐富、多元化的應用場景需求。

文章題目：Python爬蟲庫BeautifulSoup怎么用-創(chuàng)新互聯(lián)
本文鏈接：http://bm7419.com/article32/ddhpsc.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供外貿(mào)網(wǎng)站建設(shè)、Google、定制網(wǎng)站、標簽優(yōu)化、移動網(wǎng)站建設(shè)、軟件開發(fā)

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請盡快告知，我們將會在第一時間刪除。文章觀點不代表本網(wǎng)站立場，如需處理請聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時需注明來源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容