After months of debugging, testing, testing, and testing we’ve found the cause for TCP packets going missing from our EC2 instances: it is a Xen issue apparently.
TCP packets flow normally and then, suddenly, it stops. Data from remote host to the cloud instance still flow, with TCP ACKs coming in from the cloud, but no downstream data. Nothing.
tshark captures show both ends’ sockets working as expected with retransmissions, except no data from the cloud propagates to the remote host. netstat shows thousands of bytes outgoing, but doesn’t decrease.
At first I assumed it was a bad-apple router in the path, but magically a rinetd tunnel to a different port did not make the issue appear.
What I should have realized earlier was that ifconfig clearly shows TX packets dropped, which should have made me check the kernel log file: /var/log/kern.log. This gave the root cause:
xen_netfront: xennet: skb rides the rocket: 19 slots
The root cause
It seems to be a Xen kernel module bug in kernels > 3.7 (according to this). The upstream discussion on Xen is here. Apparently some traffic leads to more skb slots, which leads to the kernel module dropping the packets. According to the Xen mailing list it seems that a linearization of the pages in the data is needed.
Until the Xen upstream issue is fixed and backported to the kernels on the EC2 Linux distributions a workaround may be used. Disabling TCP segmentation offloading and generic segmentation offloading does the trick.
sudo ethtool -K eth0 tso off gso off
It may, however, have a performance impact, but so far we haven’t noticed any issues.
After thinking some black magic was involved it seems, once again, that there is always a logical explanation. The strangest of all was that the rinetd tunnel didn’t have the same behavior. In retrospect it must be due to the traffic “fingerprint” being different and not causing the excess skb slots.
If you’ve found this post useful or are struggling with something similar, please leave a comment.
EDIT: Wed, March 4, 2015
It seems that the kernel was updated with a fix as can be seen here. I will stop the workaround and see if the problem has in fact been fixed in the near future.
EDIT: Fri, September 4, 2015
I can confirm that the issue is not present anymore. Tested on Ubuntu 15.04.