Crazy network problem (solved!)
steve-plug at spwiz.com
Sat Apr 21 08:40:21 MDT 2012
I have two separate networks, I'll call them "home" and "rack". At my
rack, I have a couple physical machines that each have a whole bunch of
VMs running on them using libvirt/KVM. One physical machine is running
CentOS 5 (C5), the other is running CentOS 6 (C6). Most of my VMs are
running C5, while two are running C6. I have one VM (running C5) which
runs OpenVPN to connect between the rack and my home.
I discovered yesterday that while I can ping the C6 VMs and physical
host, I cannot do UDP or TCP sessions. I used tcpdump to capture
packets at all hops along the way. For UDP (DNS queries), I would send
out a packet, and it would return all the way to my machine, and then be
discarded before the application saw it. For TCP, I would send out a
SYN, successfully receive a SYN+ACK, and send an ACK. The remote side
would then send a packet (the SSH "welcome" string, in this case), which
my ssh app would never see.
As far as I could tell initially, everything seemed to be fine with the
received packets. The IP header checksum was correct, and the DNS
transaction ID matched. The TCP counts were correct. There was a TCP
checksum, but I didn't notice initially that Wireshark wasn't checking
it, because it wasn't flagging it. Eventually I noticed that and turned
TCP checksum validation on, and sure enough the checksum was wrong. I
assume the same was true for UDP, although I never actually checked.
So why was the checksum wrong? It turns out that the virtio driver in
C6 has some optimizations in it. It waits until a packet actually hits
the wire to calculate the checksum. Internally to the physical host, it
never bothers -- since it's never going on the wire, it doesn't
calculate it or check it. The problem was that OpenVPN was sending
those bad packets to my home network without fixing the checksum, since
they never hit the physical machine's NIC.
Temporary solution? Move the OpenVPN machine back to the C5 physical
host, so the packets are forced to go through the NIC. If this isn't
fixed soon in C6, I'll eventually need to move OpenVPN to a physical
machine that doesn't host VMs, as I'll be upgrading my machines to C6.
Fun times! :)
More information about the PLUG