Python 中beautifulsoup乱码(实际上是requests返回结果乱码)
for url in urls:
resp = rq.get(url)
# print(resp.content)
bs = bs4.BeautifulSoup(resp.text, "html.parser")
h1 = bs.findAll("h1")
pattern = re.compile("^2019年(.+)招生计划$")
pattern.match(h1[0].text)
print(h1[0].text) # .encode("utf8") string.decode("utf8")
# res = bs.findAll(is_entry_class)
res = bs.select("div.entry table")
if res is not None:
i = i+1
print(i)
for child in res[0].tbody.children:
row = []
for son in child.children:
row.append(son.text)
print("\t".join(row))
print()
调试发现 resp 返回结果采用ISO-8859-1 编码,而实际网站中头部中字符集为utf8
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>2019年...</title>

调试方法:
直接修改返回结果的编码
for url in urls:
resp = rq.get(url)
# print(resp.content)
resp.encoding = "utf8"
...

本文介绍如何解决在Python使用BeautifulSoup处理网页时遇到的乱码问题,主要原因是requests返回结果编码与网站实际编码不符。通过直接修改返回结果编码为utf8,可以有效解决该问题。
&spm=1001.2101.3001.5002&articleId=104443468&d=1&t=3&u=f5b4d44581634c69b2f1d1121c1f3381)
6万+

被折叠的 条评论
为什么被折叠?



