Python 并发深度解析

最新推荐文章于 2026-06-19 17:00:49 发布

原创最新推荐文章于 2026-06-19 17:00:49 发布 · 577 阅读

2 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#python #服务器 #开发语言

编程语言专栏收录该内容

20 篇文章

订阅专栏

本文适合有 Java 背景、正在学习 Python 的开发者。用熟悉的术语类比，从 GIL 的底层约束到 threading、multiprocessing、asyncio 三条路径，再到选型决策、常见陷阱和生产排查，系统性介绍 Python 并发。

写在前面

在学习 Python 的过程中，我发现 Python 的并发模型与 Java 有着本质的不同——Java 开发者习惯的"线程池 + synchronized + JUC"在 Python 中几乎全线失效。GIL 像一道墙，把并发世界劈成了三条互不相通的路径。

一句话总结：Python 的并发是"被 GIL 劈成三瓣的奇特景观"——线程被锁住、进程太重、协程换范式，选型取决于你的瓶颈是 CPU 还是 IO。

本文从底层约束到上层实践，覆盖 9 个核心主题：

GIL 是什么？为什么存在？如何影响并发？
threading — 被 GIL 束缚的线程，什么时候还有用？
multiprocessing — 绕过 GIL 的代价与收益
asyncio — 单线程协程，另一种并发哲学
concurrent.futures — 统一抽象层
选型决策框架 — “我该用什么？”
常见陷阱与模式
生产环境排查
未来：free-threading

本文是《Python 内存管理深度解析》的姊妹篇，延续相同的读者定位和深度风格。

下面逐一展开。

一、GIL：Python 并发的"原罪"

1.1 GIL 是什么？

GIL（Global Interpreter Lock，全局解释器锁）是 CPython 解释器中的一把互斥锁。它的规则很简单：

同一时刻，只有一个线程可以执行 Python 字节码。

┌─────────────────────────────────────────────────────┐
│                  CPython 解释器进程                    │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐              │
│  │ Thread 1│  │ Thread 2│  │ Thread 3│              │
│  └────┬────┘  └────┬────┘  └────┬────┘              │
│       │            │            │                    │
│       └────────────┼────────────┘                    │
│                    │                                │
│            ┌───────▼───────┐                        │
│            │     GIL       │  ← 同一时刻只放行一个线程  │
│            └───────┬───────┘                        │
│                    │                                │
│            ┌───────▼───────┐                        │
│            │ Python 字节码  │                        │
│            │   解释执行      │                        │
│            └───────────────┘                        │
└─────────────────────────────────────────────────────┘

这意味着：即使你创建了 8 个线程跑在 8 核 CPU 上，同一时刻也只有 1 个线程在执行 Python 代码。多线程并不能加速 CPU 密集型计算。

1.2 为什么 GIL 存在？

GIL 的根源在于 CPython 的内存管理设计。回顾 Python内存.md 中的关键事实：

每个 Python 对象都有一个 ob_refcnt 字段记录引用计数
Py_INCREF 和 Py_DECREF 是两个 C 宏，直接对 ob_refcnt 做 ++ 和 --
这两个操作没有加锁——因为 GIL 保证了同一时刻只有一个线程在修改引用计数

没有 GIL 的世界：
  Thread A: obj.refcnt++  (读=5, 写=6)
  Thread B: obj.refcnt--  (读=5, 写=4)  ← 竞态！结果应该是 5，实际是 4 或 6

有 GIL 的世界：
  Thread A 持有 GIL: obj.refcnt++  (5→6)
  Thread A 释放 GIL
  Thread B 获取 GIL: obj.refcnt--  (6→5)  ← 正确

如果去掉 GIL，CPython 需要在每一个 Py_INCREF/Py_DECREF 处加原子操作或锁——这对性能的影响可能是灾难性的，因为引用计数的增减是 Python 中最频繁的操作之一（每次赋值、传参、容器操作都会触发）。

对比 Java：Java 使用 Tracing GC（可达性分析），不需要在每次赋值时维护引用计数。JVM 的线程安全通过 JMM（Java Memory Model）和 synchronized/volatile 等机制保证，不存在 GIL 这种全局瓶颈。

1.3 GIL 的 acquire/release 周期

GIL 并非一直由一个线程持有。CPython 使用协作式调度——线程在特定时机主动释放 GIL，让其他线程有机会执行：

时间轴 ──────────────────────────────────────────────▶

Thread 1: [持有GIL执行] ──┬── [等待] ── [持有GIL执行] ──
                         │
Thread 2:    [等待] ──────┴── [持有GIL执行] ── [等待] ──

释放 GIL 的时机：
  1. 线程执行了 sys.setswitchinterval 设定的时间片（默认 5ms）
  2. 线程执行 IO 操作（读写文件、网络请求等）
  3. 线程主动调用 time.sleep()

关键参数 sys.setswitchinterval：

import sys

print(sys.getswitchinterval())  # 0.005（默认 5 毫秒）

# 调大：减少线程切换开销，但降低响应性
sys.setswitchinterval(0.01)  # 10ms

# 调小：提高响应性，但增加切换开销
sys.setswitchinterval(0.001)  # 1ms

1.4 CPU 密集型 vs IO 密集型

GIL 对两类任务的影响截然不同：

CPU 密集型（计算为主）：
  Thread 1: [████████████████]  ← 一直持有 GIL
  Thread 2:                    [████████████████]
  结果：串行执行，多线程反而因切换开销更慢

IO 密集型（等待为主）：
  Thread 1: [██] [等待IO........] [██] [等待IO........]
  Thread 2:      [██] [等待IO........] [██] [等待IO...]
  结果：等待 IO 时释放 GIL，其他线程可以执行，多线程有效

import time
import threading

# CPU 密集型：多线程不会加速
def cpu_bound():
    total = 0
    for i in range(50_000_000):
        total += i
    return total

# IO 密集型：多线程有效
def io_bound():
    time.sleep(1)  # sleep 时释放 GIL
    return "done"

# 单线程
start = time.time()
cpu_bound()
cpu_bound()
print(f"CPU 单线程: {time.time() - start:.2f}s")

# 多线程（CPU 密集型）—— 不会更快
start = time.time()
t1 = threading.Thread(target=cpu_bound)
t2 = threading.Thread(target=cpu_bound)
t1.start(); t2.start()
t1.join(); t2.join()
print(f"CPU 多线程: {time.time() - start:.2f}s")  # ≈ 单线程 × 2！

# 多线程（IO 密集型）—— 会更快
start = time.time()
t1 = threading.Thread(target=io_bound)
t2 = threading.Thread(target=io_bound)
t1.start(); t2.start()
t1.join(); t2.join()
print(f"IO 多线程: {time.time() - start:.2f}s")  # ≈ 1s 而非 2s

1.5 Java 对比

概念	Java	Python (CPython)
线程安全基础	JMM + happens-before	GIL（仅保证单个字节码操作的原子性）
锁机制	`synchronized`、`ReentrantLock`	`threading.Lock`、`threading.RLock`
可见性	`volatile` 保证	无 `volatile` 概念，GIL 隐式保证
多线程并行	✅ 真正的并行	❌ 同一时刻只有一个线程执行 Python 字节码
内存模型	完善的 JMM 规范	无正式内存模型，依赖 GIL

Java 开发者常见的思维惯性是"多线程 = 并行加速"，在 Python 中这个等式不成立。Python 的多线程是并发（concurrent）而非并行（parallel）。

二、threading：被 GIL 束缚的线程

2.1 Thread 的创建与基本使用

Python 的 threading.Thread 用法与 Java 的 Thread 类似：

import threading

# 方式 1：传入 target 函数
def worker(name, count):
    for i in range(count):
        print(f"{name}: {i}")

t = threading.Thread(target=worker, args=("Thread-1", 5))
t.start()
t.join()  # 等待线程结束，类似 Java 的 thread.join()

# 方式 2：继承 Thread 类
class MyThread(threading.Thread):
    def __init__(self, name):
        super().__init__(name=name)

    def run(self):
        print(f"Running in {self.name}")

t = MyThread("Custom")
t.start()
t.join()

2.2 ThreadPoolExecutor：同名不同命

Python 和 Java 都有 ThreadPoolExecutor，但行为截然不同：

from concurrent.futures import ThreadPoolExecutor
import time

def fetch_url(url):
    time.sleep(1)  # 模拟 IO
    return f"Response from {url}"

urls = [f"https://api.example.com/{i}" for i in range(10)]

# Python ThreadPoolExecutor：适合 IO 密集型
with ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(fetch_url, urls))
    # 10 个任务，5 个线程，约 2 秒完成（而非 10 秒）

特性	Java ThreadPoolExecutor	Python ThreadPoolExecutor
并行能力	✅ 真正的多核并行	❌ 受 GIL 限制，仅并发
适用场景	CPU 密集 + IO 密集	仅 IO 密集
核心参数	corePoolSize, maxPoolSize, keepAliveTime	max_workers
拒绝策略	AbortPolicy, CallerRunsPolicy 等	无内置拒绝策略
队列	BlockingQueue（有界/无界）	内部无界队列

2.3 同步原语

尽管 GIL 保证了单个字节码操作的原子性，复合操作仍然需要锁：

import threading

# Lock：互斥锁，类似 Java 的 ReentrantLock
lock = threading.Lock()
counter = 0

def increment():
    global counter
    for _ in range(100000):
        with lock:  # 类似 Java 的 synchronized(lock) { }
            counter += 1  # 复合操作：读-改-写，需要锁保护

# RLock：可重入锁，同一线程可多次 acquire
rlock = threading.RLock()

def recursive_func(n):
    with rlock:
        if n > 0:
            recursive_func(n - 1)  # 同一线程再次 acquire，不会死锁

# Condition：条件变量，类似 Java 的 Condition
condition = threading.Condition()
items = []

def producer():
    with condition:
        items.append("item")
        condition.notify()  # 类似 Java 的 condition.signal()

def consumer():
    with condition:
        while not items:
            condition.wait()  # 类似 Java 的 condition.await()
        item = items.pop()

# Semaphore：信号量，控制并发数
semaphore = threading.Semaphore(3)  # 最多 3 个线程同时访问

def limited_access():
    with semaphore:
        # 最多 3 个线程同时执行这里
        do_work()

# Event：事件，线程间通信
event = threading.Event()

def waiter():
    print("等待事件...")
    event.wait()  # 阻塞直到 event.set()
    print("事件已触发！")

def setter():
    time.sleep(2)
    event.set()  # 唤醒所有等待的线程

2.4 同步原语对比

原语	Python	Java
互斥锁	`threading.Lock`	`ReentrantLock`
可重入锁	`threading.RLock`	`ReentrantLock`（默认可重入）
条件变量	`threading.Condition`	`Condition`（从 Lock 创建）
信号量	`threading.Semaphore`	`Semaphore`
倒计数门闩	无内置，用 `threading.Barrier` 替代	`CountDownLatch`
循环栅栏	`threading.Barrier`	`CyclicBarrier`
读写锁	无内置（第三方 `readerwriterlock`）	`ReentrantReadWriteLock`

2.5 线程局部存储

import threading

# 类似 Java 的 ThreadLocal
thread_local = threading.local()

def process():
    thread_local.data = threading.current_thread().name
    print(thread_local.data)  # 每个线程看到自己的值

三、multiprocessing：绕过 GIL 的代价

3.1 为什么需要多进程？

threading 无法加速 CPU 密集型任务。multiprocessing 通过创建独立进程来绕过 GIL——每个进程有自己的 Python 解释器和 GIL，可以真正并行执行。

┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│   进程 1          │  │   进程 2          │  │   进程 3          │
│  ┌────────────┐  │  │  ┌────────────┐  │  │  ┌────────────┐  │
│  │ Python 解释器│  │  │  │ Python 解释器│  │  │  │ Python 解释器│  │
│  │  (独立 GIL) │  │  │  │  (独立 GIL) │  │  │  │  (独立 GIL) │  │
│  └────────────┘  │  │  └────────────┘  │  │  └────────────┘  │
│  ┌────────────┐  │  │  ┌────────────┐  │  │  ┌────────────┐  │
│  │ 独立内存空间 │  │  │  │ 独立内存空间 │  │  │  │ 独立内存空间 │  │
│  └────────────┘  │  │  └────────────┘  │  │  └────────────┘  │
└──────────────────┘  └──────────────────┘  └──────────────────┘
         ↕ IPC               ↕ IPC               ↕ IPC
    ┌──────────────────────────────────────────────────┐
    │              进程间通信 (Queue / Pipe)             │
    └──────────────────────────────────────────────────┘

3.2 Process 的创建

from multiprocessing import Process
import os

def cpu_heavy(n):
    """CPU 密集型任务：多进程可以真正并行"""
    total = 0
    for i in range(n):
        total += i
    print(f"PID {os.getpid()}: result = {total}")
    return total

if __name__ == "__main__":
    p1 = Process(target=cpu_heavy, args=(50_000_000,))
    p2 = Process(target=cpu_heavy, args=(50_000_000,))
    p1.start()
    p2.start()
    p1.join()
    p2.join()
    # 两个进程真正并行执行，总时间 ≈ 单进程时间

3.3 fork 与 COW：Python 的尴尬

很多资料说"fork 会复制父进程的整个内存空间"，这个描述过于简化。实际情况更微妙：

Linux fork() 的真实行为（Copy-on-Write）：

  物理内存视角：
  ┌─────────────────────────────────────────┐
  │  父进程内存页                              │
  │  ┌───┬───┬───┬───┬───┬───┬───┬───┐      │
  │  │ A │ B │ C │ D │ E │ F │ G │ H │      │
  │  └───┴───┴───┴───┴───┴───┴───┴───┘      │
  │    ↕ COW：父子共享同一物理页，标记只读       │
  │  ┌───┬───┬───┬───┬───┬───┬───┬───┐      │
  │  │ A │ B │ C │ D │ E │ F │ G │ H │      │
  │  └───┴───┴───┴───┴───┴───┴───┴───┘      │
  │  子进程内存页（虚拟地址独立，物理页共享）      │
  │                                          │
  │  只有真正"写入"的页才会触发复制              │
  └─────────────────────────────────────────┘

fork 本身很快——它只复制页表（虚拟地址到物理地址的映射），不复制实际数据。但 Python 有一个尴尬的问题：

# 子进程 fork 后，即使只是"读取"父进程的数据，也会触发内存复制
# 原因：读取 Python 对象 → 可能触发 ob_refcnt++ → 这是一个"写"操作 → COW 被打破

这意味着：如果子进程需要访问父进程中的大量 Python 对象，fork 几乎必然触发大量内存复制。不是 fork 的错，是 Python 引用计数的副作用。

场景对比：
┌────────────────────────────────────────────────────────────┐
│ 场景 A：父进程有 10GB 数据，子进程只做纯计算（不碰父进程数据）  │
│   → COW 生效，几乎不复制，fork 很快                         │
│                                                            │
│ 场景 B：父进程有 10GB 数据，子进程需要遍历这些数据            │
│   → 每次读取触发 refcnt++ → COW 失效 → 大量内存复制         │
│   → fork 慢 + 内存翻倍                                     │
│                                                            │
│ 场景 C：使用 spawn 模式                                     │
│   → 完全不复制，启动全新解释器                               │
│   → 启动慢（重新 import 所有模块），但内存干净               │
│   → 数据传递必须通过 pickle 序列化                          │
└────────────────────────────────────────────────────────────┘

大内存场景的替代方案：Python 3.8+ 提供了 multiprocessing.shared_memory，允许多个进程共享同一块物理内存，避免复制：

from multiprocessing import shared_memory
import numpy as np

# 创建共享内存块
a = np.array([1, 2, 3, 4, 5])
shm = shared_memory.SharedMemory(create=True, size=a.nbytes)

# 将数据写入共享内存
b = np.ndarray(a.shape, dtype=a.dtype, buffer=shm.buf)
b[:] = a[:]

# 其他进程可以通过 name 附加到同一块共享内存
# existing_shm = shared_memory.SharedMemory(name=shm.name)

3.4 fork vs spawn

multiprocessing 有三种启动方式，影响进程的创建行为：

fork (Unix 默认):
  父进程 fork() → 子进程共享父进程的物理内存页（COW）
  优点：快，子进程可以直接访问父进程的变量
  缺点：Python 引用计数容易打破 COW；多线程程序中 fork 是危险的

spawn (Windows/macOS 默认):
  父进程启动一个新的 Python 解释器 → 子进程从头 import 模块
  优点：安全，不继承父进程的内存状态，无 COW 问题
  缺点：慢，需要重新导入所有模块；数据必须 pickle 传递

forkserver:
  预先启动一个服务器进程，需要时从它 fork
  折中方案

from multiprocessing import Process, set_start_method
import multiprocessing as mp

print(mp.get_start_method())  # 'spawn' (macOS/Windows) 或 'fork' (Linux)

# 查看可用方式
print(mp.get_all_start_methods())  # ['fork', 'spawn', 'forkserver']

3.5 进程间通信（IPC）

由于每个进程有独立的内存空间，数据共享需要通过 IPC 机制：

from multiprocessing import Process, Queue, Pipe, Manager

# Queue：进程安全队列，类似 Java 的 BlockingQueue
def producer(q):
    for i in range(5):
        q.put(f"item-{i}")

def consumer(q):
    while True:
        item = q.get()
        if item is None:  # 哨兵值
            break
        print(f"Consumed: {item}")

q = Queue()
p1 = Process(target=producer, args=(q,))
p2 = Process(target=consumer, args=(q,))
p1.start(); p2.start()
p1.join()
q.put(None)  # 发送终止信号
p2.join()

# Pipe：双向通信管道
parent_conn, child_conn = Pipe()

def pipe_worker(conn):
    conn.send("Hello from child")
    print(conn.recv())  # "Hello from parent"

p = Process(target=pipe_worker, args=(child_conn,))
p.start()
print(parent_conn.recv())  # "Hello from child"
parent_conn.send("Hello from parent")
p.join()

# Manager：共享状态（底层使用代理，有序列化开销）
with Manager() as manager:
    shared_dict = manager.dict()  # 多进程共享的字典
    shared_list = manager.list()  # 多进程共享的列表

3.6 ProcessPoolExecutor

from concurrent.futures import ProcessPoolExecutor
import math

def is_prime(n):
    if n < 2:
        return False
    for i in range(2, int(math.sqrt(n)) + 1):
        if n % i == 0:
            return False
    return True

numbers = [10_000_000 + i for i in range(100)]

# ProcessPoolExecutor：CPU 密集型任务的首选
with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(is_prime, numbers))
    # 4 个进程真正并行计算

3.7 多进程的代价

代价	说明
内存复制	fork 使用 COW 不立即复制，但 Python 引用计数容易打破 COW（详见 3.3 节）。大内存场景考虑 spawn 或 `shared_memory`
序列化开销	所有传入子进程的数据必须 pickle 序列化，大数据量时成为瓶颈
启动成本	spawn 模式下每个子进程需要启动新的 Python 解释器并重新导入模块
IPC 开销	进程间通信需要序列化/反序列化，比线程间共享内存慢得多
调试困难	多进程的异常和日志比多线程更难追踪

# 序列化陷阱示例
from multiprocessing import Process

def worker(fn, data):
    return fn(data)

# ❌ lambda 不可 pickle
# p = Process(target=worker, args=(lambda x: x * 2, 10))  # PicklingError

# ✅ 使用普通函数
def double(x):
    return x * 2

p = Process(target=worker, args=(double, 10))  # OK

3.8 Java 对比

概念	Java	Python
多进程框架	无内置（需第三方或手动 Runtime.exec）	`multiprocessing` 标准库
进程池	ForkJoinPool（线程池，非进程池）	`ProcessPoolExecutor`
共享内存	堆内存天然共享	需显式 IPC（Queue/Pipe/Manager）
序列化	Java 序列化 / Kryo / Protobuf	pickle（必须可序列化）
适用场景	很少需要多进程（JVM 多线程已并行）	CPU 密集型任务的主力方案

四、asyncio：单线程协程的哲学

4.1 另一种并发思路

threading 和 multiprocessing 都是多任务同时推进的思路。asyncio 走的是另一条路——单线程 + 协作式调度。asyncio 默认在单线程上运行——所有协程共享同一个线程，事件循环在这个线程上调度它们。

多线程模型（抢占式）：
  Thread 1: ──┬──────┬──────┬──────▶
  Thread 2: ──┴──┬───┴──────┴──────▶
  Thread 3: ─────┴──┬──────────────▶
  调度者：操作系统，随时抢占

asyncio 模型（协作式，单线程）：
  Task 1: ──┬──────┬──────┬──────▶
  Task 2: ──┴──┬───┴──────┴──────▶
  Task 3: ─────┴──┬──────────────▶
  调度者：事件循环（同一线程），仅在 await 处切换

核心思想：当你在等待 IO 时，与其让线程阻塞，不如让出控制权去处理其他任务。

4.2 事件循环

事件循环（Event Loop）是 asyncio 的心脏。它是一个无限循环，不断检查哪些任务可以继续执行：

┌─────────────────────────────────────────┐
│         事件循环 (Event Loop) [单线程]     │
│                                         │
│   while True:                           │
│     1. 检查哪些 IO 操作已完成             │
│     2. 唤醒等待这些 IO 的协程             │
│     3. 执行就绪的协程直到下一个 await     │
│     4. 如果没有就绪任务，等待 IO 事件     │
│                                         │
│  ┌──────┐  ┌──────┐  ┌──────┐          │
│  │Task A│  │Task B│  │Task C│  ...     │
│  │(等待)│  │(就绪)│  │(等待)│          │
│  └──────┘  └──┬───┘  └──────┘          │
│               │ 执行                     │
│               ▼                         │
│         await some_io()                 │
│               │                         │
│         ┌─────▼─────┐                   │
│         │ 让出控制权  │ ──▶ 回到事件循环   │
│         └───────────┘                   │
└─────────────────────────────────────────┘

4.3 协程与 async/await

import asyncio

# 普通函数：调用即执行，直到 return
def normal():
    return "done"

# 协程函数：调用返回协程对象，不立即执行
async def coroutine():
    await asyncio.sleep(1)  # 让出控制权，不阻塞线程
    return "done"

# 调用协程函数不会执行它
coro = coroutine()
print(type(coro))  # <class 'coroutine'>

# 必须通过事件循环来执行
result = asyncio.run(coroutine())  # Python 3.7+ 的入口
print(result)  # "done"

await 的本质：await 不是阻塞等待，而是让出控制权——告诉事件循环"我在这里需要等待一个结果，你先去处理其他任务，结果好了再叫醒我"。

async def fetch_data(url):
    print(f"开始请求 {url}")
    # await 处让出控制权，事件循环可以去执行其他协程
    response = await http_get(url)
    print(f"完成请求 {url}")
    return response

# 并发执行多个协程
async def main():
    # 三个请求"同时"进行（单线程交错执行）
    results = await asyncio.gather(
        fetch_data("url1"),
        fetch_data("url2"),
        fetch_data("url3"),
    )
    # 总耗时 ≈ 最慢的那个请求，而非三者之和

asyncio.run(main())

4.4 Task：协程的调度单元

Task 是对协程的包装，将其提交给事件循环调度：

async def main():
    # 创建 Task：立即将协程提交给事件循环
    task1 = asyncio.create_task(fetch_data("url1"))
    task2 = asyncio.create_task(fetch_data("url2"))

    # 此时 task1 和 task2 已经在并发执行了

    # 做其他事情...
    print("两个请求已经发出，我在做其他事")

    # 等待结果
    result1 = await task1
    result2 = await task2

create_task 的生命周期：

  coroutine ──▶ Task ──▶ 事件循环调度 ──▶ 执行到 await ──▶ 挂起
                  │                                          │
                  │              ┌───────────────────────────┘
                  │              │ (IO 完成，事件循环唤醒)
                  ▼              ▼
               执行完毕 ←── 继续执行 ←── 回到就绪队列

4.5 asyncio vs threading 的选择

维度	asyncio	threading
并发模型	协作式（await 处切换）	抢占式（OS 调度）
切换开销	极小（函数调用级别）	较大（上下文切换）
内存开销	极小（一个协程 ~KB）	较大（一个线程 ~MB）
并发数量	数万协程	数十到数百线程
阻塞操作	必须用 async 版本	可以阻塞（但不推荐）
CPU 密集型	❌ 阻塞事件循环	❌ 受 GIL 限制
学习曲线	较陡（async/await 传染性）	较平缓
调试	较难（堆栈不直观）	较易

4.6 Java 对比

概念	Java	Python asyncio
异步编程模型	`CompletableFuture` + 虚拟线程（Java 21+）	`async`/`await` + 事件循环
Future	`CompletableFuture<T>`（可组合）	`asyncio.Future`（类似但更底层）
任务调度	`ForkJoinPool.commonPool()`	事件循环（单线程）
虚拟线程	Java 21 Virtual Threads（抢占式）	asyncio 协程（协作式）
生态	Spring WebFlux, Project Reactor	aiohttp, FastAPI, asyncpg

Java 21 的虚拟线程（Virtual Threads）和 Python 的 asyncio 协程目标相似（高并发 IO），但实现哲学不同：虚拟线程是抢占式的，可以阻塞而不影响其他虚拟线程；asyncio 协程是协作式的，阻塞会卡住整个事件循环。

4.7 关于 asyncio 的更多内容

asyncio 是一个庞大的话题——gather/create_task/wait/as_completed、Semaphore 限流、Queue 生产者消费者、同步原语、与同步代码的桥接、uvloop 等第三方事件循环……这些将在后续的 asyncio 专题文章中展开。本文的目标是让你理解 asyncio 的核心心智模型，足够支撑选型决策。

五、concurrent.futures：统一抽象层

5.1 Future 的概念

concurrent.futures 模块提供了线程池和进程池的统一接口。核心概念是 Future——一个代表"将来会完成的计算"的对象：

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, Future

# 两种池用完全相同的 API
def compute(x):
    return x * x

# 线程池
with ThreadPoolExecutor(max_workers=4) as executor:
    future = executor.submit(compute, 10)  # 返回 Future 对象
    print(type(future))  # <class 'concurrent.futures.Future'>
    result = future.result()  # 阻塞等待结果
    print(result)  # 100

# 进程池：API 完全相同
with ProcessPoolExecutor(max_workers=4) as executor:
    future = executor.submit(compute, 10)
    result = future.result()
    print(result)  # 100

5.2 submit vs map

from concurrent.futures import ThreadPoolExecutor, as_completed
import time
import random

def fetch(url):
    time.sleep(random.uniform(0.5, 2.0))
    return f"{url} -> done"

urls = [f"url-{i}" for i in range(10)]

# submit：逐个提交，逐个获取
with ThreadPoolExecutor(max_workers=5) as executor:
    futures = {executor.submit(fetch, url): url for url in urls}
    for future in as_completed(futures):  # 按完成顺序返回
        print(future.result())

# map：批量提交，保持顺序
with ThreadPoolExecutor(max_workers=5) as executor:
    results = executor.map(fetch, urls)  # 返回迭代器
    for result in results:  # 按提交顺序返回
        print(result)

5.3 统一接口的价值

# 只需改一行代码，就能在线程池和进程池之间切换
def run_with_executor(executor_class):
    with executor_class(max_workers=4) as executor:
        return list(executor.map(compute, range(100)))

# IO 密集型 → 线程池
results = run_with_executor(ThreadPoolExecutor)

# CPU 密集型 → 进程池
results = run_with_executor(ProcessPoolExecutor)

方法	说明	Java 类比
`submit(fn, *args)`	提交任务，返回 Future	`executor.submit(callable)`
`map(fn, *iterables)`	批量提交，保持顺序	`executor.invokeAll(tasks)`
`as_completed(fs)`	按完成顺序迭代	`CompletionService.take()`
`future.result()`	阻塞获取结果	`Future.get()`
`future.done()`	是否已完成	`Future.isDone()`
`future.cancel()`	取消任务	`Future.cancel()`
`executor.shutdown()`	关闭池（with 自动调用）	`executor.shutdown()`

六、选型决策框架

6.1 决策流程

你的任务瓶颈是什么？
│
├── CPU 密集型（计算、图像处理、数据转换）
│   │
│   └──▶ multiprocessing (ProcessPoolExecutor)
│         ├─ 任务数量少（< CPU 核数）：直接 Process
│         └─ 任务数量多：ProcessPoolExecutor
│
├── IO 密集型（网络请求、文件读写、数据库查询）
│   │
│   ├── 并发量大（数千+连接）
│   │   └──▶ asyncio (aiohttp, FastAPI, asyncpg)
│   │
│   ├── 并发量中等（数十到数百）
│   │   ├── 已有同步代码库 → threading (ThreadPoolExecutor)
│   │   └── 新项目，团队熟悉 async → asyncio
│   │
│   └── 简单并行几个 IO 任务
│       └──▶ threading (ThreadPoolExecutor)
│
└── 混合型（既有 CPU 计算又有 IO 等待）
    │
    └──▶ asyncio + run_in_executor
          ├─ 主流程用 asyncio 处理 IO
          └─ CPU 密集部分用 run_in_executor 丢给进程池

6.2 决策表格

场景	推荐方案	原因
Web 爬虫（大量并发请求）	asyncio + aiohttp	数万连接，低开销
图像/视频处理	multiprocessing	CPU 密集，需要真正并行
批量文件读写	threading	IO 密集，同步代码简单
REST API 服务	FastAPI (asyncio)	高并发 IO，生态成熟
数据科学/ML 推理	multiprocessing	CPU/GPU 密集
数据库批量操作	threading 或 asyncio	取决于驱动和并发量
消息队列消费者	threading 或 asyncio	取决于吞吐量需求
简单的后台任务	threading.Thread	最简单，够用就好

6.3 混合方案：asyncio + run_in_executor

import asyncio
import time
from concurrent.futures import ProcessPoolExecutor

def cpu_intensive(n):
    """CPU 密集型计算"""
    total = 0
    for i in range(n):
        total += i
    return total

async def io_task(url):
    """IO 密集型操作"""
    await asyncio.sleep(1)  # 模拟网络请求
    return f"fetched {url}"

async def main():
    loop = asyncio.get_running_loop()

    # IO 部分：asyncio 并发
    io_tasks = [io_task(f"url-{i}") for i in range(10)]

    # CPU 部分：丢给进程池，不阻塞事件循环
    with ProcessPoolExecutor() as pool:
        cpu_tasks = [
            loop.run_in_executor(pool, cpu_intensive, 10_000_000)
            for _ in range(4)
        ]
        all_results = await asyncio.gather(*io_tasks, *cpu_tasks)

    return all_results

asyncio.run(main())

七、常见陷阱与模式

7.1 GIL 不保证原子性

这是 Java 开发者最容易踩的坑。GIL 保证的是单个字节码操作的原子性，不是 Python 语句的原子性：

import threading

counter = 0

def increment():
    global counter
    for _ in range(1_000_000):
        counter += 1  # 看似一行，实际是多个字节码操作

threads = [threading.Thread(target=increment) for _ in range(10)]
for t in threads:
    t.start()
for t in threads:
    t.join()

print(counter)  # 期望 10,000,000，实际可能 6,xxx,xxx！

counter += 1 的字节码分解：

LOAD_GLOBAL    counter    # 读取 counter
LOAD_CONST     1          # 加载 1
BINARY_ADD                # 加法
STORE_GLOBAL   counter    # 写回 counter

线程可能在任意两个字节码之间被切换，导致丢失更新。解决方案：使用 threading.Lock。

7.2 多线程共享状态的竞态条件

import threading

class BankAccount:
    def __init__(self):
        self.balance = 0
        self.lock = threading.Lock()

    def deposit(self, amount):
        with self.lock:  # 不加锁会导致余额错误
            current = self.balance
            current += amount
            self.balance = current

# ❌ 即使有 GIL，不加锁的复合操作仍不安全
# ✅ 任何读-改-写操作都需要锁保护

7.3 asyncio 中阻塞调用的问题

import asyncio
import time

async def bad_handler():
    # ❌ time.sleep 是同步阻塞的，会卡住整个事件循环！
    time.sleep(5)
    return "done"

async def good_handler():
    # ✅ asyncio.sleep 是异步的，让出控制权
    await asyncio.sleep(5)
    return "done"

async def main():
    # 同时启动 3 个 bad_handler
    start = time.time()
    await asyncio.gather(bad_handler(), bad_handler(), bad_handler())
    print(f"Bad: {time.time() - start:.1f}s")  # ~15s（串行！）

    # 同时启动 3 个 good_handler
    start = time.time()
    await asyncio.gather(good_handler(), good_handler(), good_handler())
    print(f"Good: {time.time() - start:.1f}s")  # ~5s（并发）

常见阻塞陷阱：time.sleep()、requests.get()、open() 大文件、CPU 密集计算。解决方案：使用 async 版本（asyncio.sleep、aiohttp、aiofiles）或用 run_in_executor 将阻塞操作丢给线程池。

7.4 多进程的序列化陷阱

from multiprocessing import Process, Queue

# ❌ lambda 不可 pickle
# Process(target=lambda x: x * 2, args=(10,))  # PicklingError

# ❌ 局部类不可 pickle（spawn 模式下）
def create_worker():
    class LocalWorker:  # 定义在函数内部
        def __call__(self, x):
            return x * 2
    # Process(target=LocalWorker(), args=(10,))  # PicklingError

# ❌ Queue 中放入不可 pickle 的对象
q = Queue()
# q.put(lambda x: x)  # PicklingError

# ✅ 使用顶层函数和可 pickle 的数据
def worker(x):
    return x * 2

Process(target=worker, args=(10,))

7.5 死锁场景

import threading

lock_a = threading.Lock()
lock_b = threading.Lock()

def thread1():
    with lock_a:
        time.sleep(0.1)  # 增加死锁概率
        with lock_b:
            print("Thread 1 done")

def thread2():
    with lock_b:
        time.sleep(0.1)
        with lock_a:  # 获取顺序与 thread1 相反 → 死锁
            print("Thread 2 done")

# 两个线程互相等待对方释放锁 → 死锁

避免死锁的原则：

固定锁的获取顺序
使用 RLock 避免同一线程的重入死锁
使用 threading.Lock.acquire(timeout=...) 设置超时
尽量缩小锁的持有范围

八、生产环境排查

8.1 死锁检测

import threading
import sys
import traceback

def dump_threads():
    """打印所有线程的堆栈，类似 Java 的 jstack"""
    for thread in threading.enumerate():
        print(f"\n=== {thread.name} (id={thread.ident}) ===")
        frame = sys._current_frames().get(thread.ident)
        if frame:
            traceback.print_stack(frame)

# 定期或在死锁时调用
dump_threads()

8.2 线程/进程状态监控

import threading
import time

def monitor_threads(interval=5):
    """定期打印线程状态"""
    while True:
        threads = threading.enumerate()
        print(f"[{time.strftime('%H:%M:%S')}] Active threads: {len(threads)}")
        for t in threads:
            print(f"  - {t.name}: alive={t.is_alive()}, daemon={t.daemon}")
        time.sleep(interval)

# 启动监控线程
monitor = threading.Thread(target=monitor_threads, daemon=True)
monitor.start()

8.3 asyncio 的 debug 模式

import asyncio
import logging

# 方式 1：环境变量
# PYTHONASYNCIODEBUG=1 python script.py

# 方式 2：代码启用
asyncio.run(main(), debug=True)

# 方式 3：配置日志
logging.basicConfig(level=logging.DEBUG)
# 会输出事件循环的详细日志

# debug 模式会检测：
# - 协程未 await（"coroutine was never awaited" 警告）
# - 执行时间过长的回调（默认 100ms）
# - 事件循环中的异常被吞没

8.4 常见性能瓶颈定位

症状	可能原因	排查方法
CPU 密集型任务多线程反而慢	GIL 竞争 + 上下文切换开销	改用 multiprocessing
asyncio 应用响应慢	协程中有同步阻塞调用	检查 `time.sleep`、`requests` 等
多进程内存暴涨	进程数过多 + 数据复制	减少进程数，使用共享内存
线程数持续增长	线程池配置不当或线程泄漏	检查 ThreadPoolExecutor 的 max_workers
间歇性卡顿	GIL 持有时间过长	检查是否有长时间运行的 C 扩展不释放 GIL

九、未来：free-threading

9.1 PEP 703 改变了什么

Python 3.13 引入了实验性的 free-threading 模式（通过 --disable-gil 编译选项或 python3.13t 构建），这是 Python 并发史上最大的变革。

传统 CPython (有 GIL):               free-threading (无 GIL):
┌─────────────────────┐            ┌─────────────────────────┐
│ Thread 1 ──┐        │            │ Thread 1 ──────────────▶ │
│            ├─ GIL ──▶│            │ Thread 2 ──────────────▶ │
│ Thread 2 ──┘        │            │ Thread 3 ──────────────▶ │
│ 同一时刻仅一个执行    │            │ 真正的多线程并行！        │
└─────────────────────┘            └─────────────────────────┘

核心变化：

真正的多线程并行：多个线程可以同时执行 Python 字节码
引用计数改用原子操作：ob_refcnt 的增减使用原子指令替代 GIL 保护
对象分配器加细粒度锁：pymalloc 使用 per-size-class 锁替代全局锁

9.2 什么还没变

限制	说明
实验性	3.13 标记为实验特性，API 和行为可能变化
C 扩展兼容性	大量 C 扩展依赖 GIL 保证线程安全，需要适配
性能开销	原子操作和细粒度锁带来单线程性能下降（官方数据约 30-40%）
生态成熟度	numpy、pandas 等核心库尚未完全支持
内存开销	细粒度锁和原子操作增加内存占用

9.3 对选型框架的影响

当前（3.12 及之前）：
  CPU 密集 → multiprocessing
  IO 密集  → threading 或 asyncio

free-threading 成熟后（可能是 3.16+）：
  CPU 密集 → threading（终于可以了！）
  IO 密集  → threading 或 asyncio（不变）

  但 multiprocessing 不会消失：
  - 需要进程隔离的场景（安全边界）
  - 需要独立内存空间的场景
  - 需要跨机器扩展的场景

当前建议：继续使用本文的选型框架。free-threading 值得关注，但在生态成熟之前（至少 2-3 年），不要在生产环境中依赖它。

写在最后

Python 的并发世界之所以"奇特"，根源在于 GIL 这个历史设计决策。理解 GIL 的约束，就能理解为什么 Python 需要三条不同的并发路径——每一条都是对 GIL 的不同应对策略。

为什么 Python 并发比 Java 难学？

如果你读完本文感到"怎么要学这么多东西"，这不是错觉。

Java 开发者只需要掌握一套并发心智模型：ThreadPoolExecutor 覆盖 CPU 密集和 IO 密集，synchronized + JUC 覆盖同步需求，Java 21 的虚拟线程让 IO 密集更简单。学习路径是线性的——Thread → synchronized → JUC → 线程池 → CompletableFuture，每一步是上一步的自然延伸。

Python 开发者需要掌握三套完全不同的模型：threading 受 GIL 限制只能用于 IO、multiprocessing 有 pickle 和 fork/spawn 的坑、asyncio 要求换一种编程范式。每一条路径都是全新的概念体系，互不兼容。

Java 学习路径（线性）：                              Python 学习路径（分叉）：
                                                    ┌─ threading（哦，GIL 限制，CPU 密集不行）
Thread → synchronized → JUC → 线程池 → CF           │
──────────────────────────────────────▶    threading ─┼─ multiprocessing（哦，要 pickle，有 fork/spawn 坑）
每一步是上一步的自然延伸                                │
                                                    └─ asyncio（哦，整个编程范式变了，async 会传染）
                                                    ──────────────────────────────────────────▶
                                                    每一条分叉都是全新的概念体系

更关键的是"选择税"——Java 开发者不需要做选型决策，一个 ThreadPoolExecutor 覆盖所有场景。而 Python 的第六章"选型决策框架"本身就是一个负担的证明：你必须在动手之前先判断瓶颈是 CPU 还是 IO，选错了代价很大。

这些额外概念不是 Python 的"特性"，而是 GIL 带来的"债务"。free-threading 成熟后，Python 并发有望回归"一套模型走天下"，但在那之前，本文的三条路径就是你绕不开的地图。

路径	应对策略	适用场景
threading	接受 GIL，利用 IO 等待间隙	IO 密集型
multiprocessing	绕过 GIL，多进程并行	CPU 密集型
asyncio	换范式，单线程协作式调度	高并发 IO

记住一句话：选型取决于你的瓶颈是 CPU 还是 IO。 瓶颈在 CPU，上多进程；瓶颈在 IO，上 asyncio 或线程池；两者都有，混合使用。