request能取到网页上面的数据,但是这些是属于结构化的数据,我们不能直接使用,需要将这些数据进行转化,从而方便使用
BeautifulSoup能将标签移除掉,从而获得网页上的数据以及内容
1、将特定标签的内容取出来
单个标签
from bs4 import BeautifulSoup html_sample = '\\ \HelloWorld
\This is link1\ This is link2\\' soup= BeautifulSoup(html_sample,'html.parser') header=soup.select('h1') print(header[0].text) 多个相同的标签
from bs4 import BeautifulSoup html_sample = '\\ \HelloWorld
\This is link1\ This is link2\\' soup= BeautifulSoup(html_sample,'html.parser') header=soup.select('a') for alink in header: print(alink.text)
2、取出含有特定css属性的元素 id前面需要加#
from bs4 import BeautifulSoup html_sample = '\\ \HelloWorld
\This is link1\ This is link2\\' soup= BeautifulSoup(html_sample,'html.parser') header=soup.select('#title') print(header)
class前面加.
from bs4 import BeautifulSoup html_sample = '\\ \HelloWorld
\This is link1\ This is link2\\' soup= BeautifulSoup(html_sample,'html.parser') header=soup.select('.link') for alink in header: print(alink.text)
3、取得a标签里面链接的内容
from bs4 import BeautifulSoup html_sample = '\\ \HelloWorld
\This is link1\ This is link2\\' soup= BeautifulSoup(html_sample,'html.parser') header=soup.select('a') for alink in header: print(alink['href'])