Crazy network problem (solved!)

Steve Meyers steve-plug at
Sat Apr 21 08:40:21 MDT 2012

I have two separate networks, I'll call them "home" and "rack".  At my 
rack, I have a couple physical machines that each have a whole bunch of 
VMs running on them using libvirt/KVM.  One physical machine is running 
CentOS 5 (C5), the other is running CentOS 6 (C6).  Most of my VMs are 
running C5, while two are running C6.  I have one VM (running C5) which 
runs OpenVPN to connect between the rack and my home.

I discovered yesterday that while I can ping the C6 VMs and physical 
host, I cannot do UDP or TCP sessions.  I used tcpdump to capture 
packets at all hops along the way.  For UDP (DNS queries), I would send 
out a packet, and it would return all the way to my machine, and then be 
discarded before the application saw it.  For TCP, I would send out a 
SYN, successfully receive a SYN+ACK, and send an ACK.  The remote side 
would then send a packet (the SSH "welcome" string, in this case), which 
my ssh app would never see.

As far as I could tell initially, everything seemed to be fine with the 
received packets.  The IP header checksum was correct, and the DNS 
transaction ID matched.  The TCP counts were correct.  There was a TCP 
checksum, but I didn't notice initially that Wireshark wasn't checking 
it, because it wasn't flagging it.  Eventually I noticed that and turned 
TCP checksum validation on, and sure enough the checksum was wrong.  I 
assume the same was true for UDP, although I never actually checked.

So why was the checksum wrong?  It turns out that the virtio driver in 
C6 has some optimizations in it.  It waits until a packet actually hits 
the wire to calculate the checksum.  Internally to the physical host, it 
never bothers -- since it's never going on the wire, it doesn't 
calculate it or check it.  The problem was that OpenVPN was sending 
those bad packets to my home network without fixing the checksum, since 
they never hit the physical machine's NIC.

Temporary solution?  Move the OpenVPN machine back to the C5 physical 
host, so the packets are forced to go through the NIC.  If this isn't 
fixed soon in C6, I'll eventually need to move OpenVPN to a physical 
machine that doesn't host VMs, as I'll be upgrading my machines to C6.

Fun times! :)

Steve Meyers

