Skip to content

Introduce atomic slot migration#1949

Merged
madolson merged 125 commits into
valkey-io:unstablefrom
enjoy-binbin:slot_migration_murphyjacob4
Aug 12, 2025
Merged

Introduce atomic slot migration#1949
madolson merged 125 commits into
valkey-io:unstablefrom
enjoy-binbin:slot_migration_murphyjacob4

Conversation

@murphyjacob4

@murphyjacob4 murphyjacob4 commented Apr 13, 2025

Copy link
Copy Markdown
Contributor

Introduces a new family of commands for migrating slots via replication. The procedure is driven by the source node which pushes an AOF formatted snapshot of the slots to the target, followed by a replication stream of changes on that slot (a la manual failover).

This solution is an adaptation of the solution provided by @enjoy-binbin, combined with the solution I previously posted at #1591, modified to meet the designs we had outlined in #23.

New commands

  • CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE node-id: Begin sending the slot via replication to the target. Multiple targets can be specified by repeating SLOTSRANGE ... NODE ...
  • CLUSTER CANCELMIGRATION ALL: Cancel all slot migrations
  • CLUSTER GETSLOTMIGRATIONS: See a recent log of migrations

This PR only implements "one shot" semantics with an asynchronous model. Later, "two phase" (e.g. slot level replicate/failover commands) can be added with the same core.

Slot migration jobs

Introduces the concept of a slot migration job. While active, a job tracks a connection created by the source to the target over which the contents of the slots are sent. This connection is used for control messages as well as replicated slot data. Each job is given a 40 character random name to help uniquely identify it.

All jobs, including those that finished recently, can be observed using the CLUSTER GETSLOTMIGRATIONS command.

Replication

  • Since the snapshot uses AOF, the snapshot can be replayed verbatim to any replicas of the target node.
  • We use the same proxying mechanism used for chaining replication to copy the content sent by the source node directly to the replica nodes.

CLUSTER SYNCSLOTS

To coordinate the state machine transitions across the two nodes, a new command is added, CLUSTER SYNCSLOTS, that performs this control flow.

Each end of the slot migration connection is expected to install a read handler in order to handle CLUSTER SYNCSLOTS commands:

  • ESTABLISH: Begins a slot migration. Provides slot migration information to the target and authorizes the connection to write to unowned slots.
  • SNAPSHOT-EOF: appended to the end of the snapshot to signal that the snapshot is done being written to the target.
  • PAUSE: informs the source node to pause whenever it gets the opportunity
  • PAUSED: added to the end of the client output buffer when the pause is performed. The pause is only performed after the buffer shrinks below a configurable size
  • REQUEST-FAILOVER: request the source to either grant or deny a failover for the slot migration. The grant is only granted if the target is still paused. Once a failover is granted, the paused is refreshed for a short duration
  • FAILOVER-GRANTED: sent to the target to inform that REQUEST-FAILOVER is granted
  • ACK: heartbeat command used to ensure liveness

Interactions with other commands

  • FLUSHDB on the source node (which flushes the migrating slot) will result in the source dropping the connection, which will flush the slot on the target and reset the state machine back to the beginning. The subsequent retry should very quickly succeed (it is now empty)
  • FLUSHDB on the target will fail the slot migration. We can iterate with better handling, but for now it is expected that the operator would retry.
  • Genearlly, FLUSHDB is expected to be executed cluster wide, so preserving partially migrated slots doesn't make much sense
  • SCAN and KEYS are filtered to avoid exposing importing slot data

Error handling

  • For any transient connection drops, the migration will be failed and require the user to retry.
  • If there is an OOM while reading from the import connection, we will fail the import, which will drop the importing slot data
  • If there is a client output buffer limit reached on the source node, it will drop the connection, which will cause the migration to fail
  • If at any point the export loses ownership or either node is failed over, a callback will be triggered on both ends of the migration to fail the import. The import will not reattempt with a new owner
  • The two ends of the migration are routinely pinging each other with SYNCSLOTS ACK messages. If at any point there is no interaction on the connection for longer than repl-timeout, the connection will be dropped, resulting in migration failure
  • If a failover happens, we will drop keys in all unowned slots. The migration does not persist through failovers and would need to be retried on the new source/target.

State machine

                                                                            
                Target/Importing Node State Machine                         
   ─────────────────────────────────────────────────────────────            
                                                                            
             ┌────────────────────┐
             │SLOT_IMPORT_WAIT_ACK┼──────┐
             └──────────┬─────────┘      │
                     ACK│                │
         ┌──────────────▼─────────────┐  │
         │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤
         └──────────────┬─────────────┘  │
            SNAPSHOT-EOF│                │                                  
        ┌───────────────▼──────────────┐ │                                  
        │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤                                  
        └───────────────┬──────────────┘ │                                  
                  PAUSED│                │                                  
        ┌───────────────▼──────────────┐ │ Error Conditions:                
        │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤  1. OOM                          
        └───────────────┬──────────────┘ │  2. Slot Ownership Change        
        FAILOVER-GRANTED│                │  3. Demotion to replica          
         ┌──────────────▼─────────────┐  │  4. FLUSHDB                      
         │SLOT_IMPORT_FAILOVER_GRANTED┼──┤  5. Connection Lost              
         └──────────────┬─────────────┘  │  6. No ACK from source (timeout) 
      Takeover Performed│                │                                  
         ┌──────────────▼───────────┐    │                                  
         │SLOT_MIGRATION_JOB_SUCCESS┼────┤                                  
         └──────────────────────────┘    │                                  
                                         │                                  
   ┌─────────────────────────────────────▼─┐                                
   │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│                                
   └────────────────────┬──────────────────┘                                
Unowned Slots Cleaned Up│                                                   
          ┌─────────────▼───────────┐                                      
          │SLOT_MIGRATION_JOB_FAILED│                                      
          └─────────────────────────┘                                      

                                                                                           
                                                                                           
                      Source/Exporting Node State Machine                                  
         ─────────────────────────────────────────────────────────────                     
                                                                                           
               ┌──────────────────────┐                                                    
               │SLOT_EXPORT_CONNECTING├─────────┐                                          
               └───────────┬──────────┘         │                                          
                  Connected│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_AUTHENTICATING┼───────┤                                          
             └─────────────┬────────────┘       │                                          
              Authenticated│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_SEND_ESTABLISH┼───────┤                                          
             └─────────────┬────────────┘       │                                          
  ESTABLISH command written│                    │                                          
     ┌─────────────────────▼─────────────┐      │                                          
     │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤                                          
     └─────────────────────┬─────────────┘      │                                          
   Full response read (+OK)│                    │                                          
          ┌────────────────▼──────────────┐     │ Error Conditions:                        
          │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤  1. User sends CANCELMIGRATION           
          └────────────────┬──────────────┘     │  2. Slot ownership change                
     No other child process│                    │  3. Demotion to replica                  
              ┌────────────▼───────────┐        │  4. FLUSHDB                              
              │SLOT_EXPORT_SNAPSHOTTING┼────────┤  5. Connection Lost                      
              └────────────┬───────────┘        │  6. AUTH failed                          
              Snapshot done│                    │  7. ERR from ESTABLISH command           
               ┌───────────▼─────────┐          │  8. Unpaused before failover completed   
               │SLOT_EXPORT_STREAMING┼──────────┤  9. Snapshot failed (e.g. Child OOM)     
               └───────────┬─────────┘          │  10. No ack from target (timeout)        
                      PAUSE│                    │  11. Client output buffer overrun        
            ┌──────────────▼─────────────┐      │                                          
            │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤                                          
            └──────────────┬─────────────┘      │                                          
             Buffer drained│                    │                                          
            ┌──────────────▼────────────┐       │                                          
            │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤                                          
            └──────────────┬────────────┘       │                                          
   Failover request granted│                    │                                          
           ┌───────────────▼────────────┐       │                                          
           │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤                                          
           └───────────────┬────────────┘       │                                          
      New topology received│                    │                                          
            ┌──────────────▼───────────┐        │                                          
            │SLOT_MIGRATION_JOB_SUCCESS│        │                                          
            └──────────────────────────┘        │                                          
                                                │                                          
            ┌─────────────────────────┐         │                                          
            │SLOT_MIGRATION_JOB_FAILED│◄────────┤                                          
            └─────────────────────────┘         │                                          
                                                │                                          
           ┌────────────────────────────┐       │                                          
           │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘                                          
           └────────────────────────────┘                                                 

Closes #23.

Co-authored-by: Binbin binloveplay1314@qq.com

1. Define new structure slotRange and clusterSlotSyncLink;
2. Add CLUSTER SLOTLINK command to manage all the slot sync links.

Signed-off-by: Binbin <binloveplay1314@qq.com>
1. Add CLUSTER SLOTSYNC/SLOTSYNCFORCE command to trigger slot sync.

Signed-off-by: Binbin <binloveplay1314@qq.com>
1. Extend the SYNC command, let it specify the slot ranges;
2. Enable to filter the keys in the specified slots when generate rdb;
3. Implement the handshake process before rdb transfer for slot sync.f

Signed-off-by: Binbin <binloveplay1314@qq.com>
1. Implement the rdb transfer and loading for slot sync.

Signed-off-by: Binbin <binloveplay1314@qq.com>
1. Enable to filter the cmds in the specified slots when feed slaves;
2. Implement the messages exchange for slot sync;
3. Add clusterSlotSyncCron() to handle time events for slot sync.

Signed-off-by: Binbin <binloveplay1314@qq.com>
1. Add CLUSTER FAILOVER command to trigger slot failover;
2. Implement the process of slot failover.

Signed-off-by: Binbin <binloveplay1314@qq.com>
1. Improve the delDbKeysInSlot() to support time limit;
2. Implement the slot pending delete.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
…ot RDBs

In slot sync, we can load multiple slot RDBs from different nodes.
slot RDBs will contain functions, which may encounter an `already
exists` error when loading since every node will have the function
data. This is a limitation in ours slot RDB design.

This commit added a new RDBFLAGS_SLOT_SYNC flag, it is a hint that
means we are targeting slot rdb. With this flag, when loading the
function, we will treat it like FUNCTION LOAD REPLACE, so we won't
have errors and the function reached last will win (which won't be
a problem at the moment since all the function is the same in cluster).

Signed-off-by: Binbin <binloveplay1314@qq.com>
We have three nodes, A and B doing the slot sync, A is the src
node, and B is the dst node. Before we doing the CLUSTER
SLOTFAILOVER, we adds a new C, and C is the replica of B.

In this time, if the connection between A and B breaks, in B's
views, when doing freeClient, it will call onSlotSyncClientClose
to reset the link, and then B will try to do a new slot SYNC in
cron. And then B will call delkeysNotOwnedByMySelf to delete all
the keys in that slot, which will propagate to C and C will becoma
a empty DB.

When B doing the new slot SYNC, A will generate a new slot RDB,
and B will re-load the new slot RDB. But this new RDB file will
not be propagated to node C because the master-replica connection
between B and C is normal, which will result a data loss in C.

In this commit, after the master node loads the slot RDB, we need
to disconnect all slave nodes and allow the slave nodes to fully
synchronize (we don't support slot psync).

Signed-off-by: Binbin <binloveplay1314@qq.com>
In SET command, the expire time will be rewrite to a new
robj, which mostly is an INT:
```
        /* Propagate as SET Key Value PXAT millisecond-timestamp if there is
         * EX/PX/EXAT/PXAT flag. */
        robj *milliseconds_obj = createStringObjectFromLongLong(milliseconds);
        rewriteClientCommandVector(c, 5, shared.set, key, val, shared.pxat, milliseconds_obj);
```

And then if we are doing resharding, the server will crash.
That is because we will call isCommandInSlotRanges to check
the slot, and we will call getKeysFromCommand to get the key
from command argv. The milliseconds_obj is an INT and the code
in setGetKeys treat it as a STRING, so the server crash.

In this fix, we decided that in isCommandInSlotRanges, if we
find an INT, we decode it to a STRING.

In addition, we added a new debug enable-debug-assert so that
we can try to cover the isCommandInSlotRanges function in ours
TCL tests.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Let's take expire as an example. If the target node
deleted an expired key or the key is logic expires
during the process of the loading of the slot RDB,
the source node may propagate the expire command after
loading the slot RDB. These expire commands will become
invalid on the logically expired keys, resulting in
data loss.

During the slot migration process, the keys are not
considered expired in the expiration check of the
command propagated by the source node.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
The reply will mess up the slot replication.

Signed-off-by: Binbin <binloveplay1314@qq.com>
This causes the source node to not trim the reply,
causing the querybuf of the target node to be too
large, resulting in the target node being disconnected
dut to the querybuf limit (1G).

Signed-off-by: Binbin <binloveplay1314@qq.com>
We need to make sure this block is only executed in
the source node, otherwise targe node call it will
reset the slot failover.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
…flags

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
This can happend when multiple targets doing slot sync at
the same time. When doing disk-based slot RDB replication,
this replicationSetupReplicaForFullResync call here may result
in sending the slot RDB to a different target, or to a normal
replica, or sending a normal RDB to a target.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
rjd15372 pushed a commit that referenced this pull request Sep 23, 2025
If all cluster nodes have functions, slot migration will fail
since the target will return the function already exists error
when doing the FUNCTION LOAD.

And in addition, the target's replica could panic when it executes
the FUNCTION LOAD propagated from the primary (see
propagation-error-behavior).

Introduced in #1949.

Signed-off-by: Binbin <binloveplay1314@qq.com>
rjd15372 pushed a commit that referenced this pull request Sep 23, 2025
…ous reading of auth response (#2494)

The old SLOT_EXPORT_AUTHENTICATING added in #1949, when processed by the source node,
we will send the AUTH command and then reads the response. If the target node is blocked
during this process, the source node will also be blocked. We should use a read handler
to handle this. We split SLOT_EXPORT_AUTHENTICATING into SLOT_EXPORT_SEND_AUTH and
SLOT_EXPORT_READ_AUTH_RESPONSE to avoid this issue.

Signed-off-by: Binbin <binloveplay1314@qq.com>
rjd15372 pushed a commit that referenced this pull request Sep 23, 2025
When we adding atomic slot migration in #1949, we reused a lot of rdb save code,
it was an easier way to implement ASM in the first time, but it comes with some
side effect. Like we are using CHILD_TYPE_RDB to do the fork, we use rdb.c/rdb.h
function to save the snapshot, these mess up the logs (we will print some logs
saying we are doing RDB stuff) and mess up the info fields (we will say we are
rdb_bgsave_in_progress but actually we are doing slot migration).

In addition, it makes the code difficult to maintain. The rdb_save method uses
a lot of rdb_* variables, but we are actually doing slot migration. If we want
to support one fork with multiple target nodes, we need to rewrite these code
for a better cleanup.

Note that the changes to rdb.c/rdb.h are reverting previous changes from when
we was reusing this code for slot migration. The slot migration snapshot logic
is similar to the previous diskless replication. We use pipe to transfer the
snapshot data from the child process to the parent process.

Interface changes:
- New slot_migration_fork_in_progress info field.
- New cow_size field in CLUSTER GETSLOTMIGRATIONS command.
- Also add slot migration fork to the cluster class trace latency.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Co-authored-by: Jacob Murphy <jkmurphy@google.com>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
Introduces a new family of commands for migrating slots via replication.
The procedure is driven by the source node which pushes an AOF formatted
snapshot of the slots to the target, followed by a replication stream of
changes on that slot (a la manual failover).

This solution is an adaptation of the solution provided by
@enjoy-binbin, combined with the solution I previously posted at valkey-io#1591,
modified to meet the designs we had outlined in valkey-io#23.

## New commands

* `CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE
node-id`: Begin sending the slot via replication to the target. Multiple
targets can be specified by repeating `SLOTSRANGE ... NODE ...`
*  `CLUSTER CANCELMIGRATION ALL`: Cancel all slot migrations
* `CLUSTER GETSLOTMIGRATIONS`: See a recent log of migrations

This PR only implements "one shot" semantics with an asynchronous model.
Later, "two phase" (e.g. slot level replicate/failover commands) can be
added with the same core.

## Slot migration jobs

Introduces the concept of a slot migration job. While active, a job
tracks a connection created by the source to the target over which the
contents of the slots are sent. This connection is used for control
messages as well as replicated slot data. Each job is given a 40
character random name to help uniquely identify it.

All jobs, including those that finished recently, can be observed using
the `CLUSTER GETSLOTMIGRATIONS` command.

## Replication

* Since the snapshot uses AOF, the snapshot can be replayed verbatim to
any replicas of the target node.
* We use the same proxying mechanism used for chaining replication to
copy the content sent by the source node directly to the replica nodes.

## `CLUSTER SYNCSLOTS`

To coordinate the state machine transitions across the two nodes, a new
command is added, `CLUSTER SYNCSLOTS`, that performs this control flow.

Each end of the slot migration connection is expected to install a read
handler in order to handle `CLUSTER SYNCSLOTS` commands:

* `ESTABLISH`: Begins a slot migration. Provides slot migration
information to the target and authorizes the connection to write to
unowned slots.
* `SNAPSHOT-EOF`: appended to the end of the snapshot to signal that the
snapshot is done being written to the target.
* `PAUSE`: informs the source node to pause whenever it gets the
opportunity
* `PAUSED`: added to the end of the client output buffer when the pause
is performed. The pause is only performed after the buffer shrinks below
a configurable size
* `REQUEST-FAILOVER`: request the source to either grant or deny a
failover for the slot migration. The grant is only granted if the target
is still paused. Once a failover is granted, the paused is refreshed for
a short duration
* `FAILOVER-GRANTED`: sent to the target to inform that REQUEST-FAILOVER
is granted
* `ACK`: heartbeat command used to ensure liveness

## Interactions with other commands

* FLUSHDB on the source node (which flushes the migrating slot) will
result in the source dropping the connection, which will flush the slot
on the target and reset the state machine back to the beginning. The
subsequent retry should very quickly succeed (it is now empty)
* FLUSHDB on the target will fail the slot migration. We can iterate
with better handling, but for now it is expected that the operator would
retry.
* Genearlly, FLUSHDB is expected to be executed cluster wide, so
preserving partially migrated slots doesn't make much sense
* SCAN and KEYS are filtered to avoid exposing importing slot data

## Error handling

* For any transient connection drops, the migration will be failed and
require the user to retry.
* If there is an OOM while reading from the import connection, we will
fail the import, which will drop the importing slot data
* If there is a client output buffer limit reached on the source node,
it will drop the connection, which will cause the migration to fail
* If at any point the export loses ownership or either node is failed
over, a callback will be triggered on both ends of the migration to fail
the import. The import will not reattempt with a new owner
* The two ends of the migration are routinely pinging each other with
SYNCSLOTS ACK messages. If at any point there is no interaction on the
connection for longer than `repl-timeout`, the connection will be
dropped, resulting in migration failure
* If a failover happens, we will drop keys in all unowned slots. The
migration does not persist through failovers and would need to be
retried on the new source/target.

## State machine

```

                Target/Importing Node State Machine
   ─────────────────────────────────────────────────────────────

             ┌────────────────────┐
             │SLOT_IMPORT_WAIT_ACK┼──────┐
             └──────────┬─────────┘      │
                     ACK│                │
         ┌──────────────▼─────────────┐  │
         │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤
         └──────────────┬─────────────┘  │
            SNAPSHOT-EOF│                │
        ┌───────────────▼──────────────┐ │
        │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤
        └───────────────┬──────────────┘ │
                  PAUSED│                │
        ┌───────────────▼──────────────┐ │ Error Conditions:
        │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤  1. OOM
        └───────────────┬──────────────┘ │  2. Slot Ownership Change
        FAILOVER-GRANTED│                │  3. Demotion to replica
         ┌──────────────▼─────────────┐  │  4. FLUSHDB
         │SLOT_IMPORT_FAILOVER_GRANTED┼──┤  5. Connection Lost
         └──────────────┬─────────────┘  │  6. No ACK from source (timeout)
      Takeover Performed│                │
         ┌──────────────▼───────────┐    │
         │SLOT_MIGRATION_JOB_SUCCESS┼────┤
         └──────────────────────────┘    │
                                         │
   ┌─────────────────────────────────────▼─┐
   │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│
   └────────────────────┬──────────────────┘
Unowned Slots Cleaned Up│
          ┌─────────────▼───────────┐
          │SLOT_MIGRATION_JOB_FAILED│
          └─────────────────────────┘

                      Source/Exporting Node State Machine
         ─────────────────────────────────────────────────────────────

               ┌──────────────────────┐
               │SLOT_EXPORT_CONNECTING├─────────┐
               └───────────┬──────────┘         │
                  Connected│                    │
             ┌─────────────▼────────────┐       │
             │SLOT_EXPORT_AUTHENTICATING┼───────┤
             └─────────────┬────────────┘       │
              Authenticated│                    │
             ┌─────────────▼────────────┐       │
             │SLOT_EXPORT_SEND_ESTABLISH┼───────┤
             └─────────────┬────────────┘       │
  ESTABLISH command written│                    │
     ┌─────────────────────▼─────────────┐      │
     │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤
     └─────────────────────┬─────────────┘      │
   Full response read (+OK)│                    │
          ┌────────────────▼──────────────┐     │ Error Conditions:
          │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤  1. User sends CANCELMIGRATION
          └────────────────┬──────────────┘     │  2. Slot ownership change
     No other child process│                    │  3. Demotion to replica
              ┌────────────▼───────────┐        │  4. FLUSHDB
              │SLOT_EXPORT_SNAPSHOTTING┼────────┤  5. Connection Lost
              └────────────┬───────────┘        │  6. AUTH failed
              Snapshot done│                    │  7. ERR from ESTABLISH command
               ┌───────────▼─────────┐          │  8. Unpaused before failover completed
               │SLOT_EXPORT_STREAMING┼──────────┤  9. Snapshot failed (e.g. Child OOM)
               └───────────┬─────────┘          │  10. No ack from target (timeout)
                      PAUSE│                    │  11. Client output buffer overrun
            ┌──────────────▼─────────────┐      │
            │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤
            └──────────────┬─────────────┘      │
             Buffer drained│                    │
            ┌──────────────▼────────────┐       │
            │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤
            └──────────────┬────────────┘       │
   Failover request granted│                    │
           ┌───────────────▼────────────┐       │
           │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤
           └───────────────┬────────────┘       │
      New topology received│                    │
            ┌──────────────▼───────────┐        │
            │SLOT_MIGRATION_JOB_SUCCESS│        │
            └──────────────────────────┘        │
                                                │
            ┌─────────────────────────┐         │
            │SLOT_MIGRATION_JOB_FAILED│◄────────┤
            └─────────────────────────┘         │
                                                │
           ┌────────────────────────────┐       │
           │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘
           └────────────────────────────┘
```

Co-authored-by: Binbin <binloveplay1314@qq.com>

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
Co-authored-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
Co-authored-by: Madelyn Olson <madelyneolson@gmail.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
We now pass in rdbSnapshotOptions options in this function, and
options.conns
is now malloc'ed in the caller side, so we need to zfree it when
returning early
due to an error. Previously, conns was malloc'ed after the error
handling, so we
don't have this.

Introduced in valkey-io#1949.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
This may result in meaningless slot migration job, we should
return an error to user in advance to avoid operation error.
Also `by myself` is not correct English grammar and `myself`
is a internal code terminology, changed to `by this node`.

Was introduced in valkey-io#1949.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
If all cluster nodes have functions, slot migration will fail
since the target will return the function already exists error
when doing the FUNCTION LOAD.

And in addition, the target's replica could panic when it executes
the FUNCTION LOAD propagated from the primary (see
propagation-error-behavior).

Introduced in valkey-io#1949.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
…ous reading of auth response (valkey-io#2494)

The old SLOT_EXPORT_AUTHENTICATING added in valkey-io#1949, when processed by the source node,
we will send the AUTH command and then reads the response. If the target node is blocked
during this process, the source node will also be blocked. We should use a read handler
to handle this. We split SLOT_EXPORT_AUTHENTICATING into SLOT_EXPORT_SEND_AUTH and
SLOT_EXPORT_READ_AUTH_RESPONSE to avoid this issue.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
murphyjacob4 pushed a commit that referenced this pull request Jan 10, 2026
…n and data corruption (#3004)

When loading AOF in cluster mode, keys inside a MULTI/EXEC block could
be
inserted into wrong hash slots, causing key duplication and data
corruption.

The root cause was the slot caching optimization in getKeySlot(). This
optimization reuses a cached slot value to avoid recalculating the hash
for every key operation. However, when replaying AOF, a transaction may
contain commands affecting keys in different slots. The cached slot from
a previous command (e.g., SET k1) would incorrectly be used for
subsequent
commands in the transaction (e.g., SET k0), causing k0 to be stored in
k1's
slot.

The existing code already skipped this optimization for replicated
clients
(commands from primary) using isReplicatedClient(). This change extends
that to also skip for AOF clients by using mustObeyClient() instead,
which
covers both replicated clients and the AOF client.

Fixes #2995, introduced in #1949.

Signed-off-by: aditya.teltia <teltia.aditya22@gmail.com>
zuiderkwast pushed a commit to zuiderkwast/valkey that referenced this pull request Jan 29, 2026
…n and data corruption (valkey-io#3004)

When loading AOF in cluster mode, keys inside a MULTI/EXEC block could
be
inserted into wrong hash slots, causing key duplication and data
corruption.

The root cause was the slot caching optimization in getKeySlot(). This
optimization reuses a cached slot value to avoid recalculating the hash
for every key operation. However, when replaying AOF, a transaction may
contain commands affecting keys in different slots. The cached slot from
a previous command (e.g., SET k1) would incorrectly be used for
subsequent
commands in the transaction (e.g., SET k0), causing k0 to be stored in
k1's
slot.

The existing code already skipped this optimization for replicated
clients
(commands from primary) using isReplicatedClient(). This change extends
that to also skip for AOF clients by using mustObeyClient() instead,
which
covers both replicated clients and the AOF client.

Fixes valkey-io#2995, introduced in valkey-io#1949.

Signed-off-by: aditya.teltia <teltia.aditya22@gmail.com>
zuiderkwast pushed a commit to zuiderkwast/valkey that referenced this pull request Jan 30, 2026
…n and data corruption (valkey-io#3004)

When loading AOF in cluster mode, keys inside a MULTI/EXEC block could
be
inserted into wrong hash slots, causing key duplication and data
corruption.

The root cause was the slot caching optimization in getKeySlot(). This
optimization reuses a cached slot value to avoid recalculating the hash
for every key operation. However, when replaying AOF, a transaction may
contain commands affecting keys in different slots. The cached slot from
a previous command (e.g., SET k1) would incorrectly be used for
subsequent
commands in the transaction (e.g., SET k0), causing k0 to be stored in
k1's
slot.

The existing code already skipped this optimization for replicated
clients
(commands from primary) using isReplicatedClient(). This change extends
that to also skip for AOF clients by using mustObeyClient() instead,
which
covers both replicated clients and the AOF client.

Fixes valkey-io#2995, introduced in valkey-io#1949.

Signed-off-by: aditya.teltia <teltia.aditya22@gmail.com>
zuiderkwast pushed a commit that referenced this pull request Feb 3, 2026
…n and data corruption (#3004)

When loading AOF in cluster mode, keys inside a MULTI/EXEC block could
be
inserted into wrong hash slots, causing key duplication and data
corruption.

The root cause was the slot caching optimization in getKeySlot(). This
optimization reuses a cached slot value to avoid recalculating the hash
for every key operation. However, when replaying AOF, a transaction may
contain commands affecting keys in different slots. The cached slot from
a previous command (e.g., SET k1) would incorrectly be used for
subsequent
commands in the transaction (e.g., SET k0), causing k0 to be stored in
k1's
slot.

The existing code already skipped this optimization for replicated
clients
(commands from primary) using isReplicatedClient(). This change extends
that to also skip for AOF clients by using mustObeyClient() instead,
which
covers both replicated clients and the AOF client.

Fixes #2995, introduced in #1949.

Signed-off-by: aditya.teltia <teltia.aditya22@gmail.com>
harrylin98 pushed a commit to harrylin98/valkey_forked that referenced this pull request Feb 19, 2026
…n and data corruption (valkey-io#3004)

When loading AOF in cluster mode, keys inside a MULTI/EXEC block could
be
inserted into wrong hash slots, causing key duplication and data
corruption.

The root cause was the slot caching optimization in getKeySlot(). This
optimization reuses a cached slot value to avoid recalculating the hash
for every key operation. However, when replaying AOF, a transaction may
contain commands affecting keys in different slots. The cached slot from
a previous command (e.g., SET k1) would incorrectly be used for
subsequent
commands in the transaction (e.g., SET k0), causing k0 to be stored in
k1's
slot.

The existing code already skipped this optimization for replicated
clients
(commands from primary) using isReplicatedClient(). This change extends
that to also skip for AOF clients by using mustObeyClient() instead,
which
covers both replicated clients and the AOF client.

Fixes valkey-io#2995, introduced in valkey-io#1949.

Signed-off-by: aditya.teltia <teltia.aditya22@gmail.com>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Mar 5, 2026
…n and data corruption (valkey-io#3004)

When loading AOF in cluster mode, keys inside a MULTI/EXEC block could
be
inserted into wrong hash slots, causing key duplication and data
corruption.

The root cause was the slot caching optimization in getKeySlot(). This
optimization reuses a cached slot value to avoid recalculating the hash
for every key operation. However, when replaying AOF, a transaction may
contain commands affecting keys in different slots. The cached slot from
a previous command (e.g., SET k1) would incorrectly be used for
subsequent
commands in the transaction (e.g., SET k0), causing k0 to be stored in
k1's
slot.

The existing code already skipped this optimization for replicated
clients
(commands from primary) using isReplicatedClient(). This change extends
that to also skip for AOF clients by using mustObeyClient() instead,
which
covers both replicated clients and the AOF client.

Fixes valkey-io#2995, introduced in valkey-io#1949.

Signed-off-by: aditya.teltia <teltia.aditya22@gmail.com>
Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Apr 4, 2026
In valkey.conf, slot-migration-max-failover-repl-bytes allows setting
to -1 to disable the limit.
```
Setting this to -1 will disable this limit
```

But slot-migration-max-failover-repl-bytes is defined as MEMORY_CONFIG
and memtoull() rejects negative inputs, making it impossible to set the
value to -1 via config file or CONFIG SET.
```
>>> 'slot-migration-max-failover-repl-bytes "-1"'
argument must be a memory value
```

Introduce SIGNED_MEMORY_CONFIG flag for memory configs that also accept
plain negative number. When memtoull() fails and this flag is set, fall
back to string2ll() for parsing. Use ll2string() for CONFIG GET and
rewriteConfigNumericalOption() for CONFIG REWRITE when the value is
negative.

Add a serverAssert in initConfigValues() to enforce that PERCENT_CONFIG
and SIGNED_MEMORY_CONFIG are never combined on the same config, since
both use negative values with different semantics.

This means we have had this issue since it was introduced in valkey-io#1949.

Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin added a commit that referenced this pull request Apr 8, 2026
In valkey.conf, slot-migration-max-failover-repl-bytes allows setting
to -1 to disable the limit.
```
Setting this to -1 will disable this limit
```

But slot-migration-max-failover-repl-bytes is defined as MEMORY_CONFIG
and memtoull() rejects negative inputs, making it impossible to set the
value to -1 via config file or CONFIG SET.
```
>>> 'slot-migration-max-failover-repl-bytes "-1"'
argument must be a memory value
```

Introduce SIGNED_MEMORY_CONFIG flag for memory configs that also accept
plain negative number. When memtoull() fails and this flag is set, fall back to string2ll()
for parsing. Use ll2string() for CONFIG GET and rewriteConfigNumericalOption() for
CONFIG REWRITE when the value is negative.

Add a serverAssert in initConfigValues() to enforce that PERCENT_CONFIG
and SIGNED_MEMORY_CONFIG are never combined on the same config, since
both use negative values with different semantics.

This means we have had this issue since it was introduced in #1949.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@enjoy-binbin enjoy-binbin added enhancement New feature or request major-decision-approved Major decision approved by TSC team labels Apr 16, 2026
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
…y-io#3443)

In valkey.conf, slot-migration-max-failover-repl-bytes allows setting
to -1 to disable the limit.
```
Setting this to -1 will disable this limit
```

But slot-migration-max-failover-repl-bytes is defined as MEMORY_CONFIG
and memtoull() rejects negative inputs, making it impossible to set the
value to -1 via config file or CONFIG SET.
```
>>> 'slot-migration-max-failover-repl-bytes "-1"'
argument must be a memory value
```

Introduce SIGNED_MEMORY_CONFIG flag for memory configs that also accept
plain negative number. When memtoull() fails and this flag is set, fall back to string2ll()
for parsing. Use ll2string() for CONFIG GET and rewriteConfigNumericalOption() for
CONFIG REWRITE when the value is negative.

Add a serverAssert in initConfigValues() to enforce that PERCENT_CONFIG
and SIGNED_MEMORY_CONFIG are never combined on the same config, since
both use negative values with different semantics.

This means we have had this issue since it was introduced in valkey-io#1949.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
madolson pushed a commit that referenced this pull request Apr 27, 2026
In valkey.conf, slot-migration-max-failover-repl-bytes allows setting
to -1 to disable the limit.
```
Setting this to -1 will disable this limit
```

But slot-migration-max-failover-repl-bytes is defined as MEMORY_CONFIG
and memtoull() rejects negative inputs, making it impossible to set the
value to -1 via config file or CONFIG SET.
```
>>> 'slot-migration-max-failover-repl-bytes "-1"'
argument must be a memory value
```

Introduce SIGNED_MEMORY_CONFIG flag for memory configs that also accept
plain negative number. When memtoull() fails and this flag is set, fall back to string2ll()
for parsing. Use ll2string() for CONFIG GET and rewriteConfigNumericalOption() for
CONFIG REWRITE when the value is negative.

Add a serverAssert in initConfigValues() to enforce that PERCENT_CONFIG
and SIGNED_MEMORY_CONFIG are never combined on the same config, since
both use negative values with different semantics.

This means we have had this issue since it was introduced in #1949.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request May 7, 2026
…y-io#3443)

In valkey.conf, slot-migration-max-failover-repl-bytes allows setting
to -1 to disable the limit.
```
Setting this to -1 will disable this limit
```

But slot-migration-max-failover-repl-bytes is defined as MEMORY_CONFIG
and memtoull() rejects negative inputs, making it impossible to set the
value to -1 via config file or CONFIG SET.
```
>>> 'slot-migration-max-failover-repl-bytes "-1"'
argument must be a memory value
```

Introduce SIGNED_MEMORY_CONFIG flag for memory configs that also accept
plain negative number. When memtoull() fails and this flag is set, fall back to string2ll()
for parsing. Use ll2string() for CONFIG GET and rewriteConfigNumericalOption() for
CONFIG REWRITE when the value is negative.

Add a serverAssert in initConfigValues() to enforce that PERCENT_CONFIG
and SIGNED_MEMORY_CONFIG are never combined on the same config, since
both use negative values with different semantics.

This means we have had this issue since it was introduced in valkey-io#1949.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
valkeyrie-ops Bot pushed a commit that referenced this pull request May 18, 2026
In valkey.conf, slot-migration-max-failover-repl-bytes allows setting
to -1 to disable the limit.
```
Setting this to -1 will disable this limit
```

But slot-migration-max-failover-repl-bytes is defined as MEMORY_CONFIG
and memtoull() rejects negative inputs, making it impossible to set the
value to -1 via config file or CONFIG SET.
```
>>> 'slot-migration-max-failover-repl-bytes "-1"'
argument must be a memory value
```

Introduce SIGNED_MEMORY_CONFIG flag for memory configs that also accept
plain negative number. When memtoull() fails and this flag is set, fall back to string2ll()
for parsing. Use ll2string() for CONFIG GET and rewriteConfigNumericalOption() for
CONFIG REWRITE when the value is negative.

Add a serverAssert in initConfigValues() to enforce that PERCENT_CONFIG
and SIGNED_MEMORY_CONFIG are never combined on the same config, since
both use negative values with different semantics.

This means we have had this issue since it was introduced in #1949.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster enhancement New feature or request major-decision-approved Major decision approved by TSC team needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[NEW] Atomic slot migration HLD

9 participants