We'd been running a new stable 3-node cluster using ES 5.1.1 for about 3 weeks but upgraded to 5.2.2 after coming across a documented bug in 5.1.1. Twice in the 2 days since we upgraded we've encountered the following stack which when it occurs successively kills the process on each node in the cluster one after the other.
[2017-03-08T15:06:25,228][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [search-02] fatal error in thread [elasticsearch[search-02][bulk][T#3]], exiting
java.lang.StackOverflowError: null
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
...
The frame appears over 1000 times in the full logged stack trace. Looking at the timing in the logs what seems to happen is that this occurs on one node in the cluster and kills that nodes process. The logs on the other two nodes show connection errors trying to reach the dead node, and then eventually this stack also appears in the logs on the other node(s), killing them also and putting the cluster into a red state. I'm guessing that an offending request kills the nodes successively on retry attempts.
Unfortunately I haven't been able to isolate this to one particular client request. Each request in our client logs that failed when the cluster went down I have been able to rerun successfully after restarting the nodes. Each time it has been happened the cluster has been able to recover returning to a green state after restarting the dead processes.
It sure looks like infinite recursion or some sort of pathological case that that causes more than a thousand frames. The thread name seems to imply that it is occurring in a bulk indexing operation. We index in bulk batches of 100 documents at a time.
We are using:
Ubuntu 14.04
Elasticsearch 5.2.2 (from https://artifacts.elastic.co/packages/5.x/apt)
OpenJDK 8 (u111-b14-3~14.04.1 from https://launchpad.net/~openjdk-r/+archive/ubuntu/ppa)
I am going to see if I can isolate this to a particular request but does anyone have an idea of what might be causing the above or have any other suggestions on how to debug it?
Thanks.