MLE, MAP 对比及 MAP 转换到 L1, L2 norm 的 Math Derivation 详细

最新推荐文章于 2026-06-21 21:25:01 发布

原创最新推荐文章于 2026-06-21 21:25:01 发布 · 571 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#机器学习 #数据挖掘

机器学习核心推导专栏收录该内容

15 篇文章

订阅专栏

本文探讨了最大似然估计(MLE)与最大后验概率(MAP)的区别，指出MAP在数据有限时结合先验知识更可靠。随着数据增加，MAP趋近于MLE。当面临计算下溢问题时，通常在对数空间中进行计算。若先验概率遵循高斯分布，等价于L2正则化；若遵循拉普拉斯分布，则对应L1正则化。

往期文章链接目录

MLE v.s. MAP

MLE: learn parameters from data.
MAP: add a prior (experience) into the model; more reliable if data is limited. As we have more and more data, the prior becomes less useful.
As data increase, MAP $\rightarrow$ MLE.

Notation: $D = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}$

Framework:

MLE: $\mathop{\rm arg\,max} P(D \,|\, \theta)$
MAP: $\mathop{\rm arg\,max} P(\theta \,|\, D)$

Note that taking a product of some numbers less than 1 would approaching 0 as the number of those numbers goes to infinity, it would be not practical to compute, because of computation underflow. Hence, we will instead work in the log space.

Comparing both MLE and MAP equation, the only thing differs is the inclusion of prior
$P(\theta)$ in MAP, otherwise they are identical. What it means is that, the likelihood is now weighted with some weight coming from the prior.

If the prior follows the normal distribution, then it is the same as adding a $L 2$ regularization.

We assume $P(\theta) \sim \mathcal{N}(0, \sigma^2)$ , then $\frac{1}{\sigma \sqrt {2\pi}}exp(-\frac{\theta^2}{2\sigma^2})$