Sometimes network connections to Elasticsearch seem to be in a stuck state. This will manifest in nodes being unable to perform queries that require data located on other nodes, and in certain _cat commands failing to work consistently. Specifically, we see _status calls fail to respond, _cat/shards and _cat/indices calls fail to respond, and _cat/nodes calls respond intermittently. Queries that require only local responses from the node will succeed.
This seems to happen when a system is handling many operations sequentially. It does not necessarily require high load, but queries or update operations coming rapidly after one another seem to trigger it.
This problem appears to be an interaction between a particular kernel version, Xen, and Elasticsearch. The current (at the time of this writing) default Ubuntu distribution used for new hosts on AWS does exhibit the issue, and AWS is based on Xen.
How does it manifest?
You will have TCP connections that apparently are stalled. Like calling /_cat/shards or /_cat/indices
Another indication is the occurance of the rides the rocket log message in your kernel log. Either call dmesg or check /var/log/kern.log to see messsages like this
[14349.792093] xen_netfront: xennet: skb rides the rocket: 19 slots
The cause is a kernel bug in combination with XEN, the hypervisor used in AWS. It has been introduced kernels newer than 3.7.
A very large SKB can span too many pages (more than 16) to be put in the driver ring buffer which results in packets being dropped - in the hope that the client will retransmit and the issue does not occur anymore.
What is an SKB: The socket buffer, or "SKB", is the most fundamental data structure in the Linux networking code. Every packet sent or received is handled using this data structure. SKBs are composed of a linear data buffer, and optionally a set of 1 or more page buffers.
This problems seems to manifest easier in some AWS boxes due to their higher default MTU of 9000, but the MTU is actually not relevant, however having scatter/gather operation mode configured for the network card does make a difference.
What is a scatter/gather operation?
The data queue in front of a TCP socket is not divided into the datagrams that will go out onto the network interface. If, however, the network interface that is destined to transmit the packet can perform scatter/gather I/O, the packet need not be assembled into a single chunk, and much of that copying can be avoided. Scatter/gather I/O also enables “zero-copy” transmission of network data directly from user-space buffers.
With the gather feature, you can give the network card a datagram which is broken into pieces at different addresses in memory, which can be references to the original socket buffers. The card will read it from those locations and send it as a single unit.
Without gather (hardware requires simple, linear buffers) a datagram has to be prepared as a contiguously allocated byte string, and all the data which belongs to it has to be memcpy-d into place from the buffers that are queued for transmission on the socket.
This feature needs to be supported by your network interface card.
Upgrade your kernel
Part of ubuntu linux (3.13.0-46.75) trusty; urgency=low
Official kernel contains this in 3.18.1
Disable scatter gather/TSO
Run ethtool to disable scatter gather as well as tcp and generic segmentation offloading
sudo ethtool -K eth0 sg off
sudo ethtool -K eth0 tso off gso off
Our EC2 cluster seemed to work with only running the first ethtool command.
Both of this may result in less performance (for one the MTU is set to a lower value, but you are also losing all the advantages of scatter/gather and the hardware performance optimization of your NIC), so we highly recommend upgrading your kernel. Also, if you change this, you should restart your elasticsearch process as well.
http://packages.ubuntu.com/trusty/kernel/linux-image-3.13.0-48-generic and the changelog mentioning it