ovn_maintenance_worker deleting a healthy router external gw LRP

Bug #2148271 reported by Ching Kuo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned

Bug Description

## Summary

In a Kolla-Ansible OVN deployment, `neutron_ovn_maintenance_worker` can mark a healthy router as inconsistent, rewrite its external gateway, and leave the gateway `Logical_Router_Port` missing in OVN NB.

After that:

- Neutron still shows the router external gateway
- the Neutron gateway port still exists and is `ACTIVE`
- the OVN gateway `Logical_Router_Port` is missing
- new floating IP create/repair fails with `AttributeError: 'NoneType' object has no attribute 'options'`

## High Level Description

I am trying to use floating IPs behind an OVN Neutron router in a multi-node Kolla-Ansible deployment. A healthy router can be broken by the periodic `neutron_ovn_maintenance_worker` reconciliation flow, after which new floating IP programming fails.

## Pre-conditions

- reporter privileges: using admin when trying to reproduce the issue
- deployment: multi-node Kolla-Ansible
- OpenStack networking backend: ML2/OVN
- external/provider network on `physnet1`
- OVN gateway scheduling enabled

Test objects used in reproduction:

- `test_router_1`: `febfe173-c698-4a8b-80e1-f15202ec4123`
- `test_gateway_port_1`: `68f0bf00-866a-464f-a846-dc10f9d13922`
- `test_floating_ip_1`: `2fe3e8ab-4794-4e74-8d29-df659d0145bd`
- `test_floating_ip_2`: `b4c2fa30-6478-48e5-8b9d-526b31e8b973`

## Step-by-step Reproduction

1. Create an isolated router with an external gateway on `public`.
2. Add a tenant subnet behind that router.
3. Create a tenant port and associate `test_floating_ip_1`.
4. Verify `test_floating_ip_1` becomes `ACTIVE` and OVN has:
   - gateway LRP `lrp-test_gateway_port_1`
   - a `dnat_and_snat` NAT row for `test_floating_ip_1`
5. Wait for the periodic `neutron_ovn_maintenance_worker` consistency check.
6. Create a second tenant port and associate `test_floating_ip_2`.

## Expected Output

- the maintenance worker must not break a healthy router
- if maintenance rewrites the external gateway, `lrp-test_gateway_port_1` must still exist after the repair
- `test_floating_ip_2` should be programmed in OVN and become `ACTIVE`
- floating IP create/repair must not crash when the gateway LRP lookup returns `None`

## Actual Output

- maintenance first marks `test_router_1` inconsistent and rewrites its external gateway
- after that rewrite, OVN no longer has `lrp-test_gateway_port_1`
- `test_floating_ip_2` remains `DOWN`
- no OVN NAT row is created for `test_floating_ip_2`
- Neutron raises `AttributeError: 'NoneType' object has no attribute 'options'`
- Neutron still shows the router external gateway and the gateway port still exists and is `ACTIVE`

## Version

- OpenStack release: `2025.2`
- deployment mechanism: `Kolla-Ansible`
- Linux distro on a controller: `debian 13`
- kernel on a controller: `6.12.74+deb13+1-amd64`
- OVN version: `ovn-nbctl 25.09.0`
- Open vSwitch version: `ovs-vsctl (Open vSwitch) 3.6.0`

## Environment

- multi-node control/network deployment
- service-to-service interaction involved:
  - Neutron API / OVN maintenance worker
  - OVN northbound database
  - floating IP NAT programming on a router external gateway

## Perceived Severity

High.

The maintenance worker can corrupt a healthy router external gateway and leave floating IPs unusable until operator repair. This affects core north-south connectivity.

## Unknowns / Troubleshooting Notes

- The exact reason `test_router_1` was initially classified as inconsistent is unknown.
- The maintenance worker saw at least one bookkeeping problem on the gateway port during the repair:
  - `No revision row found for 68f0bf00-866a-464f-a846-dc10f9d13922 (type: router_ports) when bumping the revision number. Creating one.`
- Not sure why that `router_ports` revision row was missing:
- Most routers in the environment appear to continue working normally. The problem seems to affect routers that the maintenance worker decides to repair, not every router in the deployment. User reported it when deploying Kubernetes cluster using cluster-api.

## Relevant Log

### Maintenance worker starts router repair

`2026-04-13 16:28:30`

- `Maintenance task: Synchronizing Neutron and OVN databases started`
- `Number of inconsistencies found at create/update: networks=1, subnets=1, routers=1, router_ports=61, floatingips=2`
- `Fixing resource test_router_1 (type: routers)`

### Maintenance worker issues gateway rewrite

`2026-04-13 16:28:31`

- `DeleteLRouterExtGwCommand(_result=None, lrouter=neutron-test_router_1, if_exists=True, maintain_bfd=True)`
- `AddLRouterPortCommand(_result=None, name=lrp-test_gateway_port_1, ... may_exist=True, ...)`

### OVN monitor sees deletes

Same window:

- delete `Logical_Router_Static_Route`
- delete NAT row
- delete `Gateway_Chassis`
- delete `Logical_Router_Port` row for the gateway LRP

### FIP path then crashes

`2026-04-13 16:29:07`

- `AttributeError: 'NoneType' object has no attribute 'options'`

Tags: ops ovn
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.