KVM Network Bond/Bridge performance issue

ankesh · September 8, 2024, 1:35am

Hi everybody

As the title suggests, we have a bridge performance issue on one of our KVM host server(Dell Poweredge 750xs). To explain, here is a network diagram:
eth0 -------| —bond0 --| — bond1 —> br0
eth1 --------| ````````` |
eth2--------| -------------- |
(It’s a mission critical environment, so we have to maintain a 100% uptime. bond0 runs LACP and eth2 in bond1 acts as a backup when bond0 goes down)

Earlier, we were having packet drop issues on both eth0 and eth1 and I can narrow it down to ring buffer which acc. to me is quite less:

[root@server2 ~]# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:             2047
RX Mini:        n/a
RX Jumbo:       n/a
TX:             511
Current hardware settings:
RX:             2047
RX Mini:        n/a
RX Jumbo:       n/a
TX:             511

The driver is tg3 from Broadcom. So, I had to enable rps_cpus to distribute the traffic and the packet drop issue was solved.

Disabling hardware offloading also helped in achieving bridge throughput but the random high latency on VM did not go away.

Right now, the VMs are facing random high latency(espicially with LAN IP Addresses which are in the same subnet in which case it goes further). The host is giving out a latency of 0.1ms consistently but the VMs are encountering 1-2 high latency ping responses every minute. None of the packets are actually dropped though.

The server has 64c/128t(128vcpus) and load averages does not above 20 and is usually in the range of 8-9 per minute.

Any idea how we could optimize the bridge further so that there are no high latency responses at all?