geoffwilliams@home:~$

Debugging asymmetric routing

TLDR;

If this applies, reconsider what you are setting out to do and possibly redesign your network too.

Long version

Recently I had some strange stuff happening on the network:

  • Linux servers with multiple IP addresses - each belonging to virtual NICs on different VLANs
  • Hierarchy of VLANS - some can talk to everything, others are more limited
  • OPNsense router
  • Services like Nexus would work when accessed via one IP address but timeout after a few seconds when accessed with the other assigned IP address
  • ssh fails even more spectacularly with a frozen screen yet works perfectly when accessed with the server’s other IP address. A fun way to test this is running sl multiple times - and watching for a crashed train.
  • Connections freeze/error after 30 seconds.

Troubleshooting

Initially I thought this was something to do with my new OPNsense router. There were many Default deny / state violation rule messages:

OPNsense blocked

It is possible to “fix” the problem in OPNsense:

  • Firewall -> Settings -> Advanced -> Firewall Optimization: Set to conservative and connections now fail after 15 minutes instead of 30 seconds
  • Firewall -> Settings -> Advanced -> Disable firewall: Disabling firewall fixes this problem, but breaks everything else
  • Create a floating firewall rule to disable state tracking (Advanced screen)

Cause?

In a nutshell, the cause of all these problems is asymmetric routing - traffic is being routed differently on ingress vs egress, like this:

  1. Traffic arrives for a service on one IP address and is connection tracked by OPNsense
  2. Reply is sent via a different IP address
  3. OPNsense doesn’t see any of the replies, so terminates the connection (as it should!) after 30 seconds (optimization: normal) or 15 minutes (optimization: conservative)

For a more in-depth explanation, see https://docs.netgate.com/pfsense/en/latest/troubleshooting/asymmetric-routing.html

How do we prove it?

tcpdump is the defacto way to prove suspected asymmetric routing:

# eg ssh from 10.10.0.224 to 10.50.0.100, running on 10.50.0.100
tcpdump -nn -e 'port 22 and (src host 10.10.0.224 or dst host 10.50.0.100)'
09:23:57.101346 00:e0:4c:3d:67:d1 > d2:0b:44:2c:b7:1e, ethertype 802.1Q (0x8100), length 78: vlan 50, p 0, ethertype IPv4 (0x0800), 10.10.0.224.59328 > 10.50.0.100.22: Flags [S], seq 1759358987, win 64240, options [mss 1460,sackOK,TS val 2819072772 ecr 0,nop,wscale 7], length 0
09:23:57.101435 86:71:80:2f:39:9f > 90:09:df:e8:3a:a0, ethertype 802.1Q (0x8100), length 78: vlan 10, p 0, ethertype IPv4 (0x0800), 10.50.0.100.22 > 10.10.0.224.59328: Flags [S.], seq 3573815087, ack 1759358988, win 65160, options [mss 1460,sackOK,TS val 2918014551 ecr 2819072772,nop,wscale 7], length 0
09:23:57.102938 00:e0:4c:3d:67:d1 > d2:0b:44:2c:b7:1e, ethertype 802.1Q (0x8100), length 70: vlan 50, p 0, ethertype IPv4 (0x0800), 10.10.0.224.59328 > 10.50.0.100.22: Flags [.], ack 1, win 502, options [nop,nop,TS val 2819072774 ecr 2918014551], length 0
09:23:57.103570 00:e0:4c:3d:67:d1 > d2:0b:44:2c:b7:1e, ethertype 802.1Q (0x8100), length 110: vlan 50, p 0, ethertype IPv4 (0x0800), 10.10.0.224.59328 > 10.50.0.100.22: Flags [P.], seq 1:41, ack 1, win 502, options [nop,nop,TS val 2819072774 ecr 2918014551], length 40: SSH: SSH-2.0-OpenSSH_9.2p1 Debian-2+deb12u3

Eventually the firewall kills the connection:

09:24:33.053312 86:71:80:2f:39:9f > 90:09:df:e8:3a:a0, ethertype 802.1Q (0x8100), length 958: vlan 10, p 0, ethertype IPv4 (0x0800), 10.50.0.100.22 > 10.10.0.224.59328: Flags [P.], seq 205321:206209, ack 4865, win 488, options [nop,nop,TS val 2918050503 ecr 2819108056], length 888
09:24:33.885345 86:71:80:2f:39:9f > 90:09:df:e8:3a:a0, ethertype 802.1Q (0x8100), length 958: vlan 10, p 0, ethertype IPv4 (0x0800), 10.50.0.100.22 > 10.10.0.224.59328: Flags [P.], seq 205321:206209, ack 4865, win 488, options [nop,nop,TS val 2918051335 ecr 2819108056], length 888
09:24:35.549281 86:71:80:2f:39:9f > 90:09:df:e8:3a:a0, ethertype 802.1Q (0x8100), length 958: vlan 10, p 0, ethertype IPv4 (0x0800), 10.50.0.100.22 > 10.10.0.224.59328: Flags [P.], seq 205321:206209, ack 4865, win 488, options [nop,nop,TS val 2918052999 ecr 2819108056], length 888

The tcpdump capture proves asymmetric routing:

  • The traffic from 10.10.0.224 to 10.50.0.100 is tagged with VLAN 50.
  • The return traffic from 10.50.0.100 to 10.10.0.224 is tagged with VLAN 10.
  • This suggests that outbound and inbound packets are taking different paths (or interfaces) through the network.

So how do we fix this?

The problem with these approaches are that they are really just sloppy workarounds that enable bad behavior.

In particular, if traffic isn’t routed in the expected direction its both confusing and useless to have multiple connected VLANS.

The reason some of the servers were reachable on multiple IP addresses was to isolate network services from each other for security and QoS reasons, eg:

  • Nexus server needs to be reachable from anywhere so servers can download files
  • rook servers have a dedicated VLAN for OSD traffic
  • …etc,

The crux of the issue is the amount of inter VLAN traffic in the network. Recall the primary purpose of VLANS is to restrict traffic, not enable routing. All of this inter-VLAN traffic has to pass through my router on a stick (see video below) which creates its own headaches, especially when dealing with high throughput workloads like CEPH.

There are some things we can do to mitigate the router bottleneck since I have Omada smart switches available. Inter VLAN routing has a really good explanation of how to optimize things.

After considering all the options, the best thing to do in this particular case was to design the network again from scratch and put some serious thought into what VLANS are required and how/where to use these across the network.

This took longer but really paid off. After re-VLANing and re-IPing, I ended up with more VLANS - each doing one thing well, and a better plan for associating them with services. None of the above workarounds or additional switch settings were needed, ssh works everwhere without crashing and I learned a lot about networking.

Post comment

Markdown is allowed, HTML is not. All comments are moderated.