PALF-OceanBase分布式数据库基于Paxos的预写日志复制技术

最新推荐文章于 2026-06-20 22:30:38 发布

原创

最新推荐文章于 2026-06-20 22:30:38 发布 · 1.7k 阅读

标签

#oceanbase #分布式 #数据库

PALF: Replicated Write-Ahead Logging for Distributed Databases

PALF:分布式数据库的复制预写日志技术

注: 非校准状态，后续会详细研究该文章后再推出校准后的内容。如果不能等的朋友们可以先看看。

ABSTRACT(摘要)

Distributed databases have been widely researched and developed in recent years due to their scalability, availability, and consistency guarantees. The write-ahead logging (WAL) system is one of the most vital components in a database. It is still a non-trivial problem to design a replicated logging system as the foundation of a distributed database with the power of ACID transactions. This paper proposes PALF, a Paxos-backed Append-only Log File System, to address these challenges. The basic idea behind PALF is to co-design the logging system with the entire database for supporting database-specific functions and to abstract the functions as PALF primitives to power other distributed systems. Many database functions, including transaction processing, database restore, and physical standby databases, have been built based on PALF primitives. Evaluation shows that PALF greatly outperforms well-known implementations of consensus protocols and is fully competent for distributed database workloads. PALF has been deployed as a component of the OceanBase 4.0 database and has been made open-source along with it.

近年来，分布式数据库因其可扩展性、可用性和一致性保障而受到了广泛的研究和开发。预写日志（WAL）系统是数据库中最重要的组成部分之一。要设计一个具备 ACID 事务能力的分布式数据库的复制日志系统，这仍然是一个相当棘手的问题。本文提出了 PALF，这是一种基于 Paxos 的只追加日志文件系统，以解决这些挑战。PALF 的基本理念是将日志系统与整个数据库协同设计，以支持数据库特定的功能，并将这些功能抽象为 PALF 基本单元，为其他分布式系统提供支持。许多数据库功能，包括事务处理、数据库恢复和物理备用数据库，都是基于 PALF 基本单元构建的。评估表明，PALF 明显优于公认的共识协议实现，并且完全能够胜任分布式数据库工作负载。PALF 已作为 OceanBase 4.0 数据库的一个组件进行部署，并与之一同开源。

1 INTRODUCTION(简介)

The write-ahead logging (WAL) system was originally introduced to recover databases to their previous state after a failure. Beyond this initial purpose, more requirements have been gradually emerging from distributed databases. The logging system should be capable of replicating logs to multiple replicas for durability and failure tolerance. Several important database features rely on the design of the logging system, such as transaction processing [9, 50], redo log archiving [35], database backup/restore [37], and physical standby databases [36].

预写式日志系统最初是为了在发生故障后将数据库恢复到其先前的状态而被引入的。除此之外，随着分布式数据库的发展，越来越多的需求逐渐涌现出来。日志系统应当能够将日志复制到多个副本中，以确保数据的持久性和容错性。许多重要的数据库功能都依赖于日志系统的设计，例如事务处理[9, 50]、重做日志归档[35]、数据库备份/恢复[37]以及物理备用数据库[36]。

To keep consistent states of multiple database replicas, consensus protocols have been widely used to replicate logs in distributed databases [9, 42, 48]. Most of these databases were abstracted into replicated state machines (RSM) for integrating with consensus protocols. In the typical RSM model, the client first handles all intended operations and generates logs, these logs are then replicated to all replicas by consensus protocols. After operation logs have been persisted by majority of replicas, each replica applies them to its state machine.

为了保持多个数据库副本的统一状态，共识协议已被广泛用于在分布式数据库中复制日志[9, 42, 48]。这些数据库大多被抽象为复制状态机（RSM），以便与共识协议集成。在典型的 RSM 模型中，客户端首先处理所有预期的操作并生成日志，然后这些日志通过共识协议复制到所有副本中。在大多数副本保存了操作日志之后，每个副本都会将其应用到自身的状态机中。

The RSM model has been working well for operations that modify small datasets (e.g., setting a key-value to the Key-Value store). However, it may be unsuitable for operations that involves a large amount of data, an example is transactions in distributed databases. First, databases are usually required to equip additional buffer to cache temporary data from clients for log generation, therefore, it is difficult for databases to handle large transactions with data volume greater than its cache. A compromised approach is to limit the size of transactions and break up large transactions into small operating units[8], but at the cost of losing the atomicity of users’ original transactions. Second, reads in a transaction possibly cannot see previous writes in the transaction, because the writes may have not been applied to the database[9]. Reading from the cache is a possible approach; but this will introduce overhead of merging data from the storage engine and the cache, resulting in a decrease in read performance.

RSM 模型在处理修改小型数据集的操作（例如，将键值设置到键值存储中）时表现良好。然而，它可能不适用于涉及大量数据的操作，例如分布式数据库中的事务。首先，数据库通常需要配备额外的缓冲区来缓存来自客户端的临时数据以生成日志，因此，数据库难以处理数据量超过其缓存的大型事务。一种折中的方法是限制事务的大小，并将大型事务分解为较小的操作单元[8]，但这样做会失去用户原始事务的原子性。其次，事务中的读取可能无法看到事务中的先前写入操作，因为这些写入可能尚未应用于数据库[9]。从缓存中读取是一个可能的方法；但这会引入从存储引擎和缓存合并数据的开销，从而导致读取性能下降。

To address above problems, our design choice is to integrate consensus protocols into the write-ahead logging model. In the WAL model, a database writes logs using local file system interfaces, the order of logging and applying operations can be reversed compared to RSM model. Writes are applied to the storage engine of database (in-memory state machine) directly, and then redo logs (operations) are generated and flushed. Therefore, the upper limit of transaction size is expanded and read requests just need to access the storage engine. However, designing such a replicated logging system which provides guarantees like local file system to support the WAL model still faces the following challenges:

为解决上述问题，我们的设计选择是将共识协议集成到预写式日志模型中。在预写式日志模型中，数据库使用本地文件系统接口来写入日志，与读写式存储模型相比，日志的记录顺序和应用操作的顺序可以颠倒。写入操作直接应用于数据库的存储引擎（内存状态机）中，然后生成重做日志（操作）并进行刷新。因此，事务的大小上限得到了扩展，读取请求只需访问存储引擎即可。然而，设计这样一个具有与本地文件系统类似保证的复制式日志系统以支持预写式日志模型仍然面临着以下挑战：

Leader Election. In practical deployments, the database leader is usually co-located with the logging system leader to reduce latency[9, 46, 48]. The requirements of the database should be considered when electing the leader of the logging system, for example, a replica located in the same region/IDC as the application system should be elected as the leader preferentially. However, whether a replica could be elected as the leader traditionally depends on the consensus protocol itself. A leadership transfer extension has been proposed in Raft [33], but it relies on an external coordinator to actively transfer leadership to the designated replica, which harms the availability of the distributed databases.

领导者选举。在实际部署中，数据库领导者通常会与日志系统领导者位于同一位置，以减少延迟[9, 46， 48]。在选举日志系统的领导者时，应考虑数据库的要求，例如，位于与应用系统位于同一区域/IDC 的副本应优先被选为领导者。然而，传统上，一个副本能否被选为领导者取决于共识协议本身。Raft 中提出了一种领导权转移扩展[33]，但它依赖于外部协调员主动将领导权转移到指定的副本，这会损害分布式数据库的可用性。

Uncertain Replication Results. In WAL model, whether a transaction should be committed or aborted depends on whether its commit record has been persisted. Data may become inconsistent if the logging system returns an incorrect result to the transaction model due to exceptions (e.g., leadership transfer). Local file system indeed returns explicit write results. However, most consensus protocol implementations do not return explicit replication results when exceptions occur [2, 15]. For example, a previous leader had been transformed to a follower due to temporary network error. If the previous leader had not received acknowledgements for some in-flight logs before its retirement, it is not able to perceive whether the logs have been committed by majority. Therefore, transaction processing may get stuck because the transaction engine can not determine whether its commit record has been persisted.

不确定的复制结果。在 WAL 模型中，是否应提交或撤销一个事务取决于其提交记录是否已持久化。如果由于异常（例如领导权转移）导致日志系统向事务模型返回错误结果，数据可能会变得不一致。本地文件系统确实会返回明确的写入结果。然而，大多数共识协议实现在出现异常时不会返回明确的复制结果[2， 15]。例如，由于临时网络错误，之前的领导者已转变为从属节点。如果之前的领导者在退休前没有收到某些正在进行的日志的确认，它就无法感知这些日志是否已被多数人提交。因此，事务处理可能会陷入停滞，因为事务引擎无法确定其提交记录是否已持久化。

Data Change Synchronization. The log is the database, physical log synchronization is one of the most common approaches to export data changes from the database to downstream systems. For example, physical standby databases (e.g., Oracle Data Guard[36]) provide identical copies of the primary database by transporting and applying redo logs to standby databases. Unlike copying log files directly, log replication in distributed databases poses challenges in synchronizing logs from one replication group in the primary database to a downstream group in a standby database, moreover, these groups should be independently available. Some replication protocols [2, 15] embed cluster-specific information (e.g., membership) into logs, which breaks the continuity of data changes and makes the downstream replication groups unable to reconfigure the cluster independently.

数据变更同步。日志即为数据库，物理日志同步是将数据库中的数据变更导出至下游系统的最常见方法之一。例如，物理备用数据库（如 Oracle 数据守护[36]）通过传输并应用重做日志来为主数据库提供完全相同的副本。与直接复制日志文件不同，分布式数据库中的日志复制在将主数据库中一个复制组的日志同步至备用数据库中的下游组方面存在挑战，而且这些组必须是独立可用的。一些复制协议[2， 15]将特定于集群的信息（例如成员身份）嵌入日志中，这破坏了数据变更的连续性，并使下游复制组无法独立重新配置集群。

Performance. For many log replication systems, throughput of a single replication group is limited. As a result, they resort to multiple groups to improve overall throughput by parallel writing [13, 15, 31]. However, numerous replication groups may incur additional overheads. A data partition in a database is usually bound with a replication group [9, 42, 46]; more replication groups imply smaller data partitions. This will result in more distributed transactions and degrade performance of the entire database [41].

性能。对于许多日志复制系统而言，单个复制组的吞吐量是有限的。因此，它们会采用多个组的方式来提高整体吞吐量，通过并行写入来实现[13, 15, 31]。然而，大量的复制组可能会带来额外的开销。数据库中的数据分区通常与一个复制组绑定[9， 42， 46]；更多的复制组意味着更小的数据分区。这将导致更多的分布式事务，并降低整个数据库的性能[41]。

This paper presents PALF, a Paxos-backed Append-only Log File System. PALF has been co-designed with the OceanBase database to support its WAL model. It provides typical append-only logging interfaces, as a result, the database can interact with PALF much as it interacts with local files. PALF further abstracted database specific features into primitives, such a clear boundary between the log and the database brings benefits in maintainability for a practical database system, and makes PALF become a building block to construct higher-level distributed systems. These design choices led us to address above challenges by balancing the particularity of databases and the generality of logging systems.

本文介绍了 PALF，这是一种基于 Paxos 的只追加日志文件系统。PALF 是与 OceanBase 数据库共同设计的，以支持其日志写入（WAL）模型。它提供了典型的只追加日志接口，因此数据库可以像与本地文件交互一样与 PALF 进行交互。PALF 还将数据库特有的功能抽象为基本元素，这种清晰的日志与数据库之间的界限为实际的数据库系统带来了可维护性的优势，并使 PALF 成为构建更高级分布式系统的构建模块。这些设计选择使我们能够通过平衡数据库的特殊性和日志系统的通用性来解决上述挑战。

First, PALF decouples leader election from the consensus protocol to support database-related election priorities. For instance, a database replica that closer to upper applications could be elected as the leader by configuring its election priorities. As a result of independent election, a log reconfirmation stage is introduced to PALF for correctness.

首先，PALF 将领导者选举与共识协议相分离，以支持与数据库相关的选举优先级。例如，距离上层应用较近的数据库副本可以通过配置其选举优先级来被选为领导者。由于采用了独立选举，PALF 中引入了日志重新确认阶段以确保正确性。

Second, PALF returns explicit replication results to the log writer(database) unless its leader crashes, which makes PALF act like a local file. The log writer (database) will be notified of whether logs have been committed by PALF, even if the previous leader has lost its leadership. To achieve this, a novel role transition stage pending follower has therefore been introduced into the consensus protocol to determine the status of pending logs; the role of the previous leader will not be switched to follower until it receives logs from the new leader. After that, the state of transactions can be advanced. For example, the previous leader will roll back a transaction if its commit record has not been persisted by the new leader.

其次，PALF 会将明确的复制结果返回给日志写入器（数据库），除非其领导者出现故障，此时 PALF 就会像一个本地文件一样工作。日志写入器（数据库）会收到 PALF 是否已提交日志的通知，即便之前的领导者已经失去领导权也是如此。为了实现这一点，共识协议中引入了一个新的角色转换阶段“待跟从者”，用于确定待提交日志的状态；在接收到来自新领导者的日志之前，之前的领导者的角色不会切换为跟从者。此后，事务的状态可以向前推进。例如，如果新领导者的提交记录尚未保存该先前领导者的提交记录，则先前的领导者会回滚该事务。

Moreover, to synchronize data changes between distributed databases, a downstream Paxos group has been abstracted as a mirror of the primary Paxos group. It only accepts logs from the primary group and can be reconfigured independently. This feature has been used to synchronize redo logs from the primary database to standby databases in OceanBase. To the best of the authors’ knowledge, this is the first Paxos implementation that supports synchronizing proposals from one Paxos group to another group.

此外，为了在分布式数据库之间同步数据变化，一个下游的 Paxos 组被抽象为主 Paxos 组的镜像。它仅接受来自主组的日志，并且可以独立进行配置。这一特性已被用于将主数据库的重做日志同步到备用数据库中，在 OceanBase 中如此应用。据作者所知，这是首个支持将一个 Paxos 组的提议同步到另一个组的 Paxos 实现。

Finally, to reduce the overhead incurred by distributed transactions, we limit the number of log replication groups to the number of servers in a cluster. Fewer replication groups require higher throughput for a single group because it handles logs from multiple partitions. We maximize write performance with systematic optimizations such as pipeline replication, adaptive group replication, and lock-free write path.

最后，为了降低分布式事务所导致的开销，我们将日志复制组的数量限制为集群中的服务器数量。由于较少的复制组需要更高的吞吐量来处理单个组中的来自多个分区的日志，因此我们通过系统性的优化（如管道复制、自适应组复制和无锁写路径）来最大限度地提高写入性能。

To summarize, the contributions of this paper are:

总而言之，本文的贡献在于：

• PALF is proposed as the replicated write-ahead logging system of OceanBase. Its high availability, excellent performance, and file-like interfaces are suitable for distributed databases (§3).
• We abstract database-specific demands as PALF primitives, such as explicit replication results and change sequence number, which benefits OceanBase database greatly (§4).
• A novel method has been proposed to synchronize logs from a Paxos group to others, which powers functions such as physical standby databases (§5).
• We describe designs for building a high-performance consensus protocol in §6, discuss PALF’s design considerations in §7. Evaluations under both closed-loop clients and database workloads show excellent performance (§8).
• PALF 被提出作为 OceanBase 的复制式预写日志系统。其高可用性、卓越性能以及类似文件的接口特性使其适用于分布式数据库（第 3 节）。
• 我们将数据库特定的需求抽象为 PALF 基本元素，例如明确的复制结果和变更序列号，这极大地有利于 OceanBase 数据库（第 4 节）。
• 提出了一种新颖的方法来将来自 Paxos 组的日志同步到其他组，这为诸如物理备用数据库等功能提供了支持（第 5 节）。
• 在第 6 节中描述了构建高性能共识协议的设计方案，在第 7 节中讨论了 PALF 的设计考虑因素。在闭环客户端和数据库工作负载下的评估显示出了出色的性能（第 8 节）。

2 BACKGROUND(背景)

This section briefly describes the architecture of the OceanBase database to provide context for how PALF is designed.

本节简要介绍了 OceanBase 数据库的架构，以便为 PALF 的设计提供背景信息。

2.1 OceanBase Database(OB数据库)

OceanBase [46] is a distributed relational database system built on a shared-nothing architecture. The main design goals of OceanBase include compatibility with classical RDBMS, scalability, and fault tolerance. OceanBase supports ACID transactions, redo log archiving, backup and restore, physical standby databases, and many other functions. For efficient data writing, a storage engine based on log-structured merge tree (LSM-tree)[38] has been built from the ground up and co-designed with the transaction engine. The transaction engine ensures ACID properties by using a combination of pessimistic record-level locks[45] and multi-version concurrency control; it is also highly optimized for the shared-nothing architecture. For example, the commit latency of distributed transactions has been reduced to almost only one round of interaction by an improved two-phase commit procedure [46]. OceanBase relies on a Paxos-based write-ahead logging system to tolerate failures. This brings the benefits of distributed systems, but incurs log replication overhead at the same time.

OceanBase [46] 是一款基于无共享架构构建的分布式关系型数据库系统。OceanBase 的主要设计目标包括与传统关系型数据库管理系统（RDBMS）的兼容性、可扩展性和容错性。OceanBase 支持 ACID 事务、重做日志归档、备份与恢复、物理备用数据库以及许多其他功能。为了实现高效的数据写入，我们从零开始构建了一个基于日志结构合并树（LSM-tree）[38] 的存储引擎，并与事务引擎共同设计。事务引擎通过结合悲观记录级锁[45] 和多版本并发控制来确保 ACID 属性；同时，它还针对无共享架构进行了高度优化。例如，通过改进的两阶段提交程序[46]，分布式事务的提交延迟已降低到几乎只有一次交互的时间。OceanBase 基于基于 Paxos 的预写式日志系统来容忍故障。这带来了分布式系统的优点，但同时也伴随着日志复制的开销。

2.2 Redesigned Architecture(重新设计架构)

In the previous version of OceanBase [46] (1.0-3.0), the basic unit of transaction processing, logging, and data storage was the table partition. As an increasing number of applications have adopted OceanBase, we found that the previous architecture is not as wellsuited to medium and small enterprises as to large-scale clusters of large companies. One of the problems is the overhead of log replication. OceanBase enables users to create tens of thousands of partitions in each server. This number of Paxos groups consume significant resources for no real purpose, therefore raising the bar for deployments and operations. Another challenge is the huge transaction problem. One such transaction probably spans tens of thousands of partitions, which means that there are tens of thousands of participants in the two-phase commit protocol, which will destabilize the system and sacrifice performance.

在 OceanBase 的前一版本（46）（1.0 - 3.0）中，事务处理、日志记录和数据存储的基本单元是表分区。随着越来越多的应用采用了 OceanBase，我们发现之前的架构对于中型和小型企业来说并不像对于大型公司的大规模集群那样适用。其中一个问题是日志复制的开销。OceanBase 允许用户在每个服务器上创建数万个分区。这些数量的 Paxos 组为了没有实际意义的目的而消耗了大量的资源，因此增加了部署和操作的门槛。另一个挑战是巨大的事务问题。这样的一个事务可能跨越数万个分区，这意味着在两阶段提交协议中有数万个参与者，这将使系统不稳定并牺牲性能。

To address these challenges, the internal architecture of version 4.0 of the OceanBase database was redesigned [47]. A new component, Stream, has been proposed, which consists of several data partitions, a replicated write-ahead logging system, and a transaction engine. The key insight of the Stream is that tables in a database are still partitioned, but the basic unit of transaction and logging is a set of partitions in a Stream, rather than a single partition. A table partition simply represents a piece of data stored in the storage engine. The transaction engine generates redo logs for recording modifications of multiple partitions within a Stream and stores logs in the WAL of the Stream. Multiple replicas of a Stream are created on different servers. Only one of them will be elected as the leader and serve data writing requests. The number of replication groups in a cluster can be reduced to the number of servers to eliminate the overhead incurred by massive replication groups.