Dropped TCP packets on EC2 and Xen

After months of debugging, testing, testing, and testing we’ve found the cause for TCP packets going missing from our EC2 instances: it is a Xen issue apparently.

Symptoms

TCP packets flow normally and then, suddenly, it stops. Data from remote host to the cloud instance still flow, with TCP ACKs coming in from the cloud, but no downstream data. Nothing.

tshark captures show both ends’ sockets working as expected with retransmissions, except no data from the cloud propagates to the remote host. netstat shows thousands of bytes outgoing, but doesn’t decrease.

At first I assumed it was a bad-apple router in the path, but magically a rinetd tunnel to a different port did not make the issue appear.

What I should have realized earlier was that ifconfig clearly shows TX packets dropped, which should have made me check the kernel log file: /var/log/kern.log. This gave the root cause:

The root cause

It seems to be a Xen kernel module bug in kernels > 3.7 (according to this). The upstream discussion on Xen is here. Apparently some traffic leads to more skb slots, which leads to the kernel module dropping the packets. According to the Xen mailing list it seems that a linearization of the pages in the data is needed.

A workaround

Until the Xen upstream issue is fixed and backported to the kernels on the EC2 Linux distributions a workaround may be used. Disabling TCP segmentation offloading and generic segmentation offloading does the trick.

It may, however, have a performance impact, but so far we haven’t noticed any issues.

Summary

After thinking some black magic was involved it seems, once again, that there is always a logical explanation. The strangest of all was that the rinetd tunnel didn’t have the same behavior. In retrospect it must be due to the traffic “fingerprint” being different and not causing the excess skb slots.

If you’ve found this post useful or are struggling with something similar, please leave a comment.

EDIT: Wed, March 4, 2015

It seems that the kernel was updated with a fix as can be seen here. I will stop the workaround and see if the problem has in fact been fixed in the near future.

EDIT: Fri, September 4, 2015

I can confirm that the issue is not present anymore. Tested on Ubuntu 15.04.

2 Replies to “Dropped TCP packets on EC2 and Xen”

  1. Thank you. The work around stopped the logged error messages and the dropped packets. As a note, placed the ethtool changes at the end of the script:

    /etc/sysconfig/network-scripts/ifup-post

    but if someone has a better location for F20, I’m all ears.

    -Don

Leave a Comment