Introduce atomic slot migration by murphyjacob4 · Pull Request #1949 · valkey-io/valkey

murphyjacob4 · 2025-04-13T12:54:26Z

Introduces a new family of commands for migrating slots via replication. The procedure is driven by the source node which pushes an AOF formatted snapshot of the slots to the target, followed by a replication stream of changes on that slot (a la manual failover).

This solution is an adaptation of the solution provided by @enjoy-binbin, combined with the solution I previously posted at #1591, modified to meet the designs we had outlined in #23.

New commands

CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE node-id: Begin sending the slot via replication to the target. Multiple targets can be specified by repeating SLOTSRANGE ... NODE ...
CLUSTER CANCELMIGRATION ALL: Cancel all slot migrations
CLUSTER GETSLOTMIGRATIONS: See a recent log of migrations

This PR only implements "one shot" semantics with an asynchronous model. Later, "two phase" (e.g. slot level replicate/failover commands) can be added with the same core.

Slot migration jobs

Introduces the concept of a slot migration job. While active, a job tracks a connection created by the source to the target over which the contents of the slots are sent. This connection is used for control messages as well as replicated slot data. Each job is given a 40 character random name to help uniquely identify it.

All jobs, including those that finished recently, can be observed using the CLUSTER GETSLOTMIGRATIONS command.

Replication

Since the snapshot uses AOF, the snapshot can be replayed verbatim to any replicas of the target node.
We use the same proxying mechanism used for chaining replication to copy the content sent by the source node directly to the replica nodes.

`CLUSTER SYNCSLOTS`

To coordinate the state machine transitions across the two nodes, a new command is added, CLUSTER SYNCSLOTS, that performs this control flow.

Each end of the slot migration connection is expected to install a read handler in order to handle CLUSTER SYNCSLOTS commands:

ESTABLISH: Begins a slot migration. Provides slot migration information to the target and authorizes the connection to write to unowned slots.
SNAPSHOT-EOF: appended to the end of the snapshot to signal that the snapshot is done being written to the target.
PAUSE: informs the source node to pause whenever it gets the opportunity
PAUSED: added to the end of the client output buffer when the pause is performed. The pause is only performed after the buffer shrinks below a configurable size
REQUEST-FAILOVER: request the source to either grant or deny a failover for the slot migration. The grant is only granted if the target is still paused. Once a failover is granted, the paused is refreshed for a short duration
FAILOVER-GRANTED: sent to the target to inform that REQUEST-FAILOVER is granted
ACK: heartbeat command used to ensure liveness

Interactions with other commands

FLUSHDB on the source node (which flushes the migrating slot) will result in the source dropping the connection, which will flush the slot on the target and reset the state machine back to the beginning. The subsequent retry should very quickly succeed (it is now empty)
FLUSHDB on the target will fail the slot migration. We can iterate with better handling, but for now it is expected that the operator would retry.
Genearlly, FLUSHDB is expected to be executed cluster wide, so preserving partially migrated slots doesn't make much sense
SCAN and KEYS are filtered to avoid exposing importing slot data

Error handling

For any transient connection drops, the migration will be failed and require the user to retry.
If there is an OOM while reading from the import connection, we will fail the import, which will drop the importing slot data
If there is a client output buffer limit reached on the source node, it will drop the connection, which will cause the migration to fail
If at any point the export loses ownership or either node is failed over, a callback will be triggered on both ends of the migration to fail the import. The import will not reattempt with a new owner
The two ends of the migration are routinely pinging each other with SYNCSLOTS ACK messages. If at any point there is no interaction on the connection for longer than repl-timeout, the connection will be dropped, resulting in migration failure
If a failover happens, we will drop keys in all unowned slots. The migration does not persist through failovers and would need to be retried on the new source/target.

State machine

                                                                            
                Target/Importing Node State Machine                         
   ─────────────────────────────────────────────────────────────            
                                                                            
             ┌────────────────────┐
             │SLOT_IMPORT_WAIT_ACK┼──────┐
             └──────────┬─────────┘      │
                     ACK│                │
         ┌──────────────▼─────────────┐  │
         │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤
         └──────────────┬─────────────┘  │
            SNAPSHOT-EOF│                │                                  
        ┌───────────────▼──────────────┐ │                                  
        │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤                                  
        └───────────────┬──────────────┘ │                                  
                  PAUSED│                │                                  
        ┌───────────────▼──────────────┐ │ Error Conditions:                
        │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤  1. OOM                          
        └───────────────┬──────────────┘ │  2. Slot Ownership Change        
        FAILOVER-GRANTED│                │  3. Demotion to replica          
         ┌──────────────▼─────────────┐  │  4. FLUSHDB                      
         │SLOT_IMPORT_FAILOVER_GRANTED┼──┤  5. Connection Lost              
         └──────────────┬─────────────┘  │  6. No ACK from source (timeout) 
      Takeover Performed│                │                                  
         ┌──────────────▼───────────┐    │                                  
         │SLOT_MIGRATION_JOB_SUCCESS┼────┤                                  
         └──────────────────────────┘    │                                  
                                         │                                  
   ┌─────────────────────────────────────▼─┐                                
   │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│                                
   └────────────────────┬──────────────────┘                                
Unowned Slots Cleaned Up│                                                   
          ┌─────────────▼───────────┐                                      
          │SLOT_MIGRATION_JOB_FAILED│                                      
          └─────────────────────────┘                                      

                                                                                           
                                                                                           
                      Source/Exporting Node State Machine                                  
         ─────────────────────────────────────────────────────────────                     
                                                                                           
               ┌──────────────────────┐                                                    
               │SLOT_EXPORT_CONNECTING├─────────┐                                          
               └───────────┬──────────┘         │                                          
                  Connected│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_AUTHENTICATING┼───────┤                                          
             └─────────────┬────────────┘       │                                          
              Authenticated│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_SEND_ESTABLISH┼───────┤                                          
             └─────────────┬────────────┘       │                                          
  ESTABLISH command written│                    │                                          
     ┌─────────────────────▼─────────────┐      │                                          
     │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤                                          
     └─────────────────────┬─────────────┘      │                                          
   Full response read (+OK)│                    │                                          
          ┌────────────────▼──────────────┐     │ Error Conditions:                        
          │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤  1. User sends CANCELMIGRATION           
          └────────────────┬──────────────┘     │  2. Slot ownership change                
     No other child process│                    │  3. Demotion to replica                  
              ┌────────────▼───────────┐        │  4. FLUSHDB                              
              │SLOT_EXPORT_SNAPSHOTTING┼────────┤  5. Connection Lost                      
              └────────────┬───────────┘        │  6. AUTH failed                          
              Snapshot done│                    │  7. ERR from ESTABLISH command           
               ┌───────────▼─────────┐          │  8. Unpaused before failover completed   
               │SLOT_EXPORT_STREAMING┼──────────┤  9. Snapshot failed (e.g. Child OOM)     
               └───────────┬─────────┘          │  10. No ack from target (timeout)        
                      PAUSE│                    │  11. Client output buffer overrun        
            ┌──────────────▼─────────────┐      │                                          
            │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤                                          
            └──────────────┬─────────────┘      │                                          
             Buffer drained│                    │                                          
            ┌──────────────▼────────────┐       │                                          
            │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤                                          
            └──────────────┬────────────┘       │                                          
   Failover request granted│                    │                                          
           ┌───────────────▼────────────┐       │                                          
           │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤                                          
           └───────────────┬────────────┘       │                                          
      New topology received│                    │                                          
            ┌──────────────▼───────────┐        │                                          
            │SLOT_MIGRATION_JOB_SUCCESS│        │                                          
            └──────────────────────────┘        │                                          
                                                │                                          
            ┌─────────────────────────┐         │                                          
            │SLOT_MIGRATION_JOB_FAILED│◄────────┤                                          
            └─────────────────────────┘         │                                          
                                                │                                          
           ┌────────────────────────────┐       │                                          
           │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘                                          
           └────────────────────────────┘

Closes #23.

Co-authored-by: Binbin binloveplay1314@qq.com

1. Define new structure slotRange and clusterSlotSyncLink; 2. Add CLUSTER SLOTLINK command to manage all the slot sync links. Signed-off-by: Binbin <binloveplay1314@qq.com>

1. Add CLUSTER SLOTSYNC/SLOTSYNCFORCE command to trigger slot sync. Signed-off-by: Binbin <binloveplay1314@qq.com>

1. Extend the SYNC command, let it specify the slot ranges; 2. Enable to filter the keys in the specified slots when generate rdb; 3. Implement the handshake process before rdb transfer for slot sync.f Signed-off-by: Binbin <binloveplay1314@qq.com>

1. Implement the rdb transfer and loading for slot sync. Signed-off-by: Binbin <binloveplay1314@qq.com>

1. Enable to filter the cmds in the specified slots when feed slaves; 2. Implement the messages exchange for slot sync; 3. Add clusterSlotSyncCron() to handle time events for slot sync. Signed-off-by: Binbin <binloveplay1314@qq.com>

1. Add CLUSTER FAILOVER command to trigger slot failover; 2. Implement the process of slot failover. Signed-off-by: Binbin <binloveplay1314@qq.com>

1. Improve the delDbKeysInSlot() to support time limit; 2. Implement the slot pending delete. Signed-off-by: Binbin <binloveplay1314@qq.com>