DeepSeekMathV2的理解2——方法之数学证明的验证

最新推荐文章于 2026-06-30 20:33:07 发布

原创最新推荐文章于 2026-06-30 20:33:07 发布 · 314 阅读

2 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#人工智能

强化学习专栏收录该内容

33 篇文章

订阅专栏

AI 时代程序员必备技能

Codex、Claude Code、Cursor、Hermes Agent、OpenClaw等工程化实战专栏，讲透 AI 如何接管脏活累活

一键订阅

文章目录

一、前言
二、DeepSeekMathV2
2. 方法

一、前言

仅供参考，未经实验验证。

二、DeepSeekMathV2

论文标题： DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
（DeepSeekMath-V2：迈向自我验证的数学推理）
作者： Zhihong Shao 等
机构： DeepSeek-AI
发表时间： 2025年11月27日
GitHub： https://github.com/deepseek-ai/DeepSeek-Math-V2
论文地址： https://arxiv.org/pdf/2511.22570

2. 方法

2.1. 证明的验证

2.1.1. 训练验证器以识别问题并评估证明

We developed high-level rubrics $I_v$ for proof evaluation (see Appendix A.2) with the goal of training a verifier to evaluate proofs according to these rubrics, mirroring mathematical experts’ assessment process. Specifically, given a problem $X$ and a proof $Y$ , the verifier $\pi_\varphi(\cdot|X, Y, I_v)$ is designed to produce a proof analysis that first summarizes identified issues (if any) and then assigns a score based on three levels: 1 for complete and rigorous proofs with all logical steps clearly justified; 0.5 for proofs with sound overall logic but minor errors or omitted details; and 0 for fundamentally flawed proofs containing fatal logical errors or critical gaps.
我们制定了用于证明评估的宏观评分准则 $I_v$ （详见附录 A.2），旨在训练验证器依据这些准则评估证明，从而模拟数学专家的评审过程。具体而言，给定问题 $X$ 和证明 $Y$ ，验证器 $\pi_\varphi(\cdot|X, Y, I_v)$ 的设计目标是生成一份证明分析：首先总结识别出的问题（如有），然后按三个等级赋分：1 分——完整且严谨的证明，所有逻辑步骤均有清晰论证；0.5 分——整体逻辑正确但存在小错误或细节遗漏的证明；0 分——存在致命逻辑错误或关键漏洞的根本性缺陷证明。

更通俗的翻译：

我们设计了一套证明打分的"评分标准"（见附录 A.2），让验证器像数学专家一样按这套标准给证明打分。具体来说，验证器拿到题目 $X$ 和证明 $Y$ 后，会先写一份分析：先列出它发现的问题（如果有的话），然后按三档打分：

1 分：证明完整、严谨，每一步逻辑都交代得清清楚楚；
0.5 分：整体思路没问题，但有点小错误或者漏掉了一些细节；
0 分：证明从根本上就有问题，存在致命逻辑错误或者关键漏洞。

¹https://kskedlaya.org/putnam-archive/putnam2024stats.html

Curating Cold Start RL Data We constructed our initial training data through the following process:
构建冷启动强化学习数据 我们通过以下流程构建了初始训练数据：

We crawled problems from Art of Problem Solving (AoPS) contests², prioritizing math olympiads, team selection tests, and post-2010 problems explicitly requiring proofs, totaling 17,503 problems. This problem set is denoted as $\mathcal{D}_p$ .
我们从 Art of Problem Solving（AoPS）竞赛网站爬取了题目，优先选择数学奥林匹克竞赛、国家队选拔测试，以及2010年之后明确要求给出证明的题目，共计 17,503 道。该题集记为 $\mathcal{D}_p$ 。
We generated candidate proofs using a variant of DeepSeek-V3.2-Exp-Thinking. As this model was not optimized for theorem proving and tended to produce concise but error-prone outputs, we prompted it to iteratively refine its proofs over multiple rounds to improve comprehensiveness and rigor.
我们使用 DeepSeek-V3.2-Exp-Thinking 的一个变体模型生成候选证明。由于该模型未针对定理证明进行优化，倾向于生成简洁但容易出错的输出，我们提示它经过多轮迭代来优化其证明，以提升全面性和严谨性。
We randomly sampled proofs across diverse problem types (e.g., algebra and number theory) and had mathematical experts score each proof according to the evaluation rubrics described above.
我们在不同问题类型（如代数与数论）中随机采样证明，并邀请数学专家按照上述评估准则为每份证明打分。

This process yielded an initial RL dataset $\mathcal{D}_v = \{(X_i, Y_i, s_i)\}$ , where each item consists of a problem $X_i$ , a proof $Y_i$ , and an overall proof score $s_i \in \{0, 0.5, 1\}$ .
这一过程产生了一个初始的强化学习数据集 $\mathcal{D}_v = \{(X_i, Y_i, s_i)\}$ ，其中每个样本包含一个问题 $X_i$ 、一个证明 $Y_i$ ，以及一个总体证明得分 $s_i \in \{0, 0.5, 1\}$ 。

RL Objective. Building on a version of DeepSeek-V3.2-Exp-SFT which was supervised fine-tuned on reasoning data related to mathematics and code, we trained the model with reinforcement learning to produce proof analyses using two reward components:
强化学习目标。 基于一个已在数学和代码相关推理数据上进行过监督微调的 DeepSeek-V3.2-Exp-SFT 版本，我们使用强化学习训练该模型生成证明分析，采用两个奖励组件：

Format reward $R_{\text{format}}$ : An indicator function that enforces the model to generate both a summary of identified issues and a proof score, by checking whether the final response contains the key phrase “Here is my evaluation of the solution:” as well as a score within \boxed{\} following “Based on my evaluation, the final overall score should be:”.
格式奖励 $R_{\text{format}}$ ：一个指示函数，通过检查最终回复是否包含关键短语 “Here is my evaluation of the solution:”，以及是否包含 “Based on my evaluation, the final overall score should be:” 之后的 \boxed{} 中的分数，来强制模型同时生成已识别问题的摘要和证明得分。
Score reward $R_{\text{score}}$ : Rewards based on proximity between predicted score $s'_i$ and annotated score $s_i$ :
得分奖励 $R_{\text{score}}$ ：基于预测得分 $s'_i$ 与标注得分 $s_i$ 之间接近程度的奖励：
$R_{\text{score}}(s'_i, s_i) = 1 - |s'_i - s_i| \quad (1)$

The RL objective for training the verifier is:
训练验证器的强化学习目标为：
$\max_{\pi_\varphi} \mathbb{E}_{(X_i, Y_i, s_i) \sim \mathcal{D}_v, (V'_i, s'_i) \sim \pi_\varphi(\cdot | X_i, Y_i)} [R_{\text{format}}(V'_i) \cdot R_{\text{score}}(s'_i, s_i)] \quad (2)$

where $V'_i$ denotes the verifier’s final response and $s'_i$ is the proof score extracted from it.
其中 $V'_i$ 表示验证器的最终回复， $s'_i$ 是从该回复中提取的证明得分。

更通俗的翻译：

经过上述流程，我们得到了一个初始的强化学习数据集，里面每个样本都包含一道题、一份证明，以及专家打的总分（0、0.5 或 1）。

训练目标是什么？ 我们先拿一个已经在数学和代码推理数据上微调过的模型（DeepSeek-V3.2-Exp-SFT），然后用强化学习继续训练它当"裁判"。奖励分两块：

格式奖励：模型必须按固定格式输出——先总结发现的问题，再给出分数。系统会检查回复里有没有 “Here is my evaluation of the solution:” 这句话，以及 “Based on my evaluation, the final overall score should be:” 后面有没有 \boxed{} 包着的分数。格式对了才给分。
得分奖励：模型预测的分数 $s'_i$ 和专家标注的分数 $s_i$ 越接近，奖励越高。具体计算是 $1 - |s'_i - s_i|$ ，差值为 0 得满分 1，差 0.5 得 0.5，差 1 得 0。

最终训练目标就是：让模型既按格式输出，又把分数打准，两者奖励相乘作为优化目标。

2.1.2. 引入元验证来审查证明分析

The approach described in Section 2.1.1 trains proof verification through RL to align predicted proof scores with expert annotations, but provides no direct supervision on the identified issues themselves. This creates a critical vulnerability: when evaluating flawed proofs (where $s_i < 1$ ) during training, the verifier can receive full reward by predicting the correct scores while hallucinating non-existent issues, undermining its trustworthiness.
2.1.1 节描述的方法通过强化学习训练证明验证，使预测的证明得分与专家标注对齐，但并未对已识别的问题本身提供直接监督。这产生了一个关键漏洞：在训练过程中评估有缺陷的证明（ $s_i < 1$ ）时，验证器可以通过预测正确的得分，同时幻觉出不存在的问题，来获得完整奖励，从而损害其可信度。

To address this problem, we introduce meta-verification: a secondary evaluation process that assesses whether issues identified by the verifier indeed exist and whether these issues logically justify the predicted proof score according to the evaluation rubrics $\mathcal{I}_v$ . The complete meta-verification rubrics $\mathcal{I}_{mv}$ are detailed in Appendix A.3.
为解决这个问题，我们引入了元验证：一种二次评估过程，用于评估验证器识别出的问题是否确实存在，以及这些问题是否根据评估准则 $\mathcal{I}_v$ 在逻辑上能够支撑预测的证明得分。完整的元验证准则 $\mathcal{I}_{mv}$ 详见附录 A.3。

²https://artofproblemsolving.com/community/c13\_contest\_collections

We trained a dedicated meta-verifier using RL to perform this evaluation. By incorporating the meta-verifier’s feedback into verifier training, we can improve the faithfulness of the verifier’s issue identification.
我们使用强化学习训练了一个专门的元验证器来执行这一评估。通过将元验证器的反馈纳入验证器训练，我们可以提高验证器问题识别的忠实度。

Meta-Verifier Training Process

We obtained an initial verifier $\pi_\varphi$ following Section 2.1.1.
Mathematical experts scored the quality of verifier responses according to $\mathcal{I}_{mv}$ , creating dataset $\mathcal{D}_{mv} = \{(X_i, Y_i, V_i, ms_i)\}$ , where $V_i$ is the analysis of proof $Y_i$ and $ms_i \in \{0, 0.5, 1\}$ is the expert-annotated quality score.
We trained a meta-verifier $\pi_\eta(\cdot|X, Y, V, \mathcal{I}_{mv})$ to analyze the verifier’s proof analysis $V$ .

元验证器训练过程

按照 2.1.1 节的方法，我们得到了一个初始验证器 $\pi_\varphi$ 。
数学专家根据 $\mathcal{I}_{mv}$ 对验证器回复的质量进行打分，构建了数据集 $\mathcal{D}_{mv} = \{(X_i, Y_i, V_i, ms_i)\}$ ，其中 $V_i$ 是对证明 $Y_i$ 的分析， $ms_i \in \{0, 0.5, 1\}$ 是专家标注的质量分数。
我们训练了一个元验证器 $\pi_\eta(\cdot|X, Y, V, \mathcal{I}_{mv})$ 来分析验证器的证明分析 $V$ 。

The meta-verifier produces a summary of issues found in the analysis itself, followed by a quality score measuring how accurate and justified the verifier’s analysis is. The RL objective follows the same structure as the verifier training, with format and score rewards.
元验证器首先生成对分析本身中发现的问题的摘要，然后给出一个质量分数，衡量验证器分析的准确性和合理性。强化学习目标遵循与验证器训练相同的结构，包含格式奖励和得分奖励。

Using the trained meta-verifier $\pi_\eta$ , we enhanced the verifier training by integrating meta-verification feedback into the reward function:
利用训练好的元验证器 $\pi_\eta$ ，我们通过将元验证反馈整合到奖励函数中，增强了验证器训练：
$R_V = R_{\text{format}} \cdot R_{\text{score}} \cdot R_{\text{meta}} \quad (3)$

where $R_{\text{meta}}$ is the quality score from the meta-verifier.
其中 $R_{\text{meta}}$ 是来自元验证器的质量分数。

We trained the enhanced verifier on both the verification dataset $\mathcal{D}_v$ and the meta-verification dataset $\mathcal{D}_{mv}$ , using the same reward mechanism on $\mathcal{D}_{mv}$ as used for training the meta-verifier. The resulting model can perform both proof verification and meta-verification tasks.
我们在验证数据集 $\mathcal{D}_v$ 和元验证数据集 $\mathcal{D}_{mv}$ 上训练了增强后的验证器，在 $\mathcal{D}_{mv}$ 上使用与训练元验证器相同的奖励机制。最终得到的模型可以同时执行证明验证和元验证任务。

On a validation split of $\mathcal{D}_v$ , the average quality score of the verifier’s proof analyses – as evaluated by the meta-verifier – improved from 0.85 to 0.96, while maintaining the same accuracy in proof score prediction.
在 $\mathcal{D}_v$ 的验证集上，验证器证明分析的平均质量分数——由元验证器评估——从 0.85 提升到了 0.96，同时保持了证明得分预测的相同准确率。

通俗解释：

核心问题：验证器是个"裁判"，但它可能作弊。比如一份证明确实有错（专家打了 0.5 分），验证器为了拿分，随便编几个"问题"出来，同时分数碰巧猜对了，照样拿满分。这就是"幻觉问题"——它报告的问题根本不存在。

解决方案：再请一个"裁判的裁判"（元验证器），专门审查裁判的报告：

你说证明里有 A 问题，A 真的存在吗？
你打 0.5 分，你列出的这些问题真的值 0.5 分吗？

训练流程：

先训练出初始裁判（验证器）
让专家给裁判的报告打分，构建元验证训练数据
训练元验证器（裁判的裁判）
把元验证器的评分也纳入裁判的奖励函数——裁判不仅要格式对、分数准，报告的问题还得真实可靠

效果：验证器分析报告的质量从 0.85 提升到 0.96，而且打分准确率没有下降。简单说：裁判变得更诚实了，不再瞎编问题。

问题1：总结一下

DeepSeekMath-V2 证明的验证部分（2.1 节）总结

一、核心目标

训练一个可靠的 AI 裁判（验证器），让它能够像数学专家一样：

仔细阅读一份数学证明
找出其中的逻辑漏洞和错误
给出一个公正的评分（1 分/0.5 分/0 分）

这个裁判的质量直接决定了后续生成器能走多远——裁判越靠谱，生成器越能写出好证明。

二、怎么训练裁判？（2.1.1 节）

第一步：准备训练素材

从 AoPS 竞赛网站爬取 17,503 道需要证明的数学题（奥赛、选拔测试等）
用 DeepSeek-V3.2-Exp-Thinking 生成候选证明（让它多轮迭代，尽量写得严谨）
请数学专家按照评分标准给每份证明打分，构建初始数据集

第二步：强化学习训练

训练目标有两个奖励，必须同时满足：

奖励	要求	通俗解释
格式奖励	输出必须包含固定格式：先总结发现的问题，再给出分数	就像考试作文必须按格式写，否则零分
得分奖励	预测的分数与专家标注越接近，奖励越高	打分的准确度