Python爬虫详解：原理、常用库与实战案例（建议收藏）

最新推荐文章于 2025-06-16 12:01:43 发布

原创最新推荐文章于 2025-06-16 12:01:43 发布 · 1.2k 阅读

8 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#python #爬虫 #开发语言 #其他

文章目录

一、爬虫究竟是什么？（原理大揭秘）

先来个灵魂拷问：为什么你的浏览器能看网页，Python爬虫就不能看？其实浏览器和爬虫本质上都是网络请求的发送者，只是呈现方式不同罢了！

举个栗子🌰：当你在某宝搜索"机械键盘"时：

浏览器发送HTTP请求到服务器
服务器返回HTML数据
浏览器渲染成可视页面

而爬虫的工作流程是这样的（敲黑板！）：

发送请求 -> 获取响应 -> 解析数据 -> 存储数据

（重点来了）最大的区别在于：浏览器渲染页面，爬虫提取结构化数据！

二、爬虫必备的5大神器（库）

1. Requests库（HTTP请求之王）

import requests
response = requests.get('https://example.com')
print(response.text)  # 秒获网页源码！

（亲测有效）比urllib简单10倍不止，处理cookie、session都超方便！

2. BeautifulSoup4（HTML解析专家）

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
title = soup.find('h1').text  # 精准提取标题

3. Scrapy框架（工业级解决方案）

适合大型项目，自带：

请求调度
数据管道
中间件
异步处理
（项目实战必备！）

4. Selenium（动态网页克星）

对付JS渲染的页面，直接上浏览器自动化：

from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://dynamic-site.com')

5. PyQuery（jQuery风格解析）

如果你熟悉jQuery语法：

from pyquery import PyQuery as pq
doc = pq(html)
price = doc('.product-price').text()

三、新手必看的3大实战案例

案例1：电商价格监控（requests+bs4）

import requests
from bs4 import BeautifulSoup

url = 'https://某电商.com/手机'
headers = {'User-Agent': 'Mozilla/5.0'}  # 伪装浏览器！

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')

# 提取商品信息
products = []
for item in soup.select('.product-item'):
    name = item.select('.title')[0].text.strip()
    price = item.select('.price')[0].text[1:]  # 去掉¥符号
    products.append({'name': name, 'price': float(price)})

print(f"共抓取到{len(products)}款手机价格！")

案例2：动态加载数据（selenium实战）

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get('https://某社交网站.com')

# 等待登录框加载
wait = WebDriverWait(driver, 10)
login_form = wait.until(EC.presence_of_element_located((By.ID, 'loginForm')))

# 自动填写表单
driver.find_element(By.NAME, 'username').send_keys('your_account')
driver.find_element(By.NAME, 'password').send_keys('your_password')
driver.find_element(By.XPATH, '//button[@type="submit"]').click()

# 获取动态内容
posts = driver.find_elements(By.CLASS_NAME, 'post-content')
print(f"当前页面有{len(posts)}条动态")

案例3：应对反爬（代理IP+随机UA）

import requests
from fake_useragent import UserAgent

ua = UserAgent()
headers = {'User-Agent': ua.random}

proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}

try:
    response = requests.get(
        'https://target-site.com',
        headers=headers,
        proxies=proxies,
        timeout=5
    )
    print("请求成功！状态码:", response.status_code)
except Exception as e:
    print("请求失败:", str(e))

四、避坑指南（血泪经验！）

遵守robots协议：在域名后加/robots.txt查看规则
设置合理间隔：用time.sleep(random.uniform(1,3))避免高频访问
错误处理必须做：try-except包裹关键代码
注意法律风险：不爬敏感数据、个人隐私
User-Agent要轮换：别用默认的Python-UA

（超级重要）建议刚开始选择允许爬取的网站练习，比如：

豆瓣电影（开放API）
政府公开数据平台
维基百科

五、数据存储方案选型

存储方式	适用场景	推荐库
CSV	小规模数据	csv
Excel	需要可视化	openpyxl
MySQL	结构化数据	pymysql
MongoDB	非结构化数据	pymongo
JSON	临时存储	json
SQLite	单机轻量级	sqlite3