5.2.2 OpenJDK 8 UnmodifiableCollection StackOverFlowError

We'd been running a new stable 3-node cluster using ES 5.1.1 for about 3 weeks but upgraded to 5.2.2 after coming across a documented bug in 5.1.1. Twice in the 2 days since we upgraded we've encountered the following stack which when it occurs successively kills the process on each node in the cluster one after the other.

[2017-03-08T15:06:25,228][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [search-02] fatal error in thread [elasticsearch[search-02][bulk][T#3]], exiting
java.lang.StackOverflowError: null
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
...

The frame appears over 1000 times in the full logged stack trace. Looking at the timing in the logs what seems to happen is that this occurs on one node in the cluster and kills that nodes process. The logs on the other two nodes show connection errors trying to reach the dead node, and then eventually this stack also appears in the logs on the other node(s), killing them also and putting the cluster into a red state. I'm guessing that an offending request kills the nodes successively on retry attempts.

Unfortunately I haven't been able to isolate this to one particular client request. Each request in our client logs that failed when the cluster went down I have been able to rerun successfully after restarting the nodes. Each time it has been happened the cluster has been able to recover returning to a green state after restarting the dead processes.

It sure looks like infinite recursion or some sort of pathological case that that causes more than a thousand frames. The thread name seems to imply that it is occurring in a bulk indexing operation. We index in bulk batches of 100 documents at a time.

We are using:
Ubuntu 14.04
Elasticsearch 5.2.2 (from https://artifacts.elastic.co/packages/5.x/apt)
OpenJDK 8 (u111-b14-3~14.04.1 from https://launchpad.net/~openjdk-r/+archive/ubuntu/ppa)

I am going to see if I can isolate this to a particular request but does anyone have an idea of what might be causing the above or have any other suggestions on how to debug it?

Thanks.

Do you have Monitoring installed to see if it highlights anything? Having that setup as a secondary cluster would be useful given the disruptive nature of the error!

It turns out this was not caused by any specific request but what looks like a memory leak-ish bug I have submitted here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.