5.2.2 OpenJDK 8 UnmodifiableCollection StackOverFlowError

jkcoz · March 8, 2017, 9:34pm

We'd been running a new stable 3-node cluster using ES 5.1.1 for about 3 weeks but upgraded to 5.2.2 after coming across a documented bug in 5.1.1. Twice in the 2 days since we upgraded we've encountered the following stack which when it occurs successively kills the process on each node in the cluster one after the other.

[2017-03-08T15:06:25,228][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [search-02] fatal error in thread [elasticsearch[search-02][bulk][T#3]], exiting
java.lang.StackOverflowError: null
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
at java.util.Collections$UnmodifiableCollection.contains(Collections.java:1032) ~[?:1.8.0_111]
...

The frame appears over 1000 times in the full logged stack trace. Looking at the timing in the logs what seems to happen is that this occurs on one node in the cluster and kills that nodes process. The logs on the other two nodes show connection errors trying to reach the dead node, and then eventually this stack also appears in the logs on the other node(s), killing them also and putting the cluster into a red state. I'm guessing that an offending request kills the nodes successively on retry attempts.

Unfortunately I haven't been able to isolate this to one particular client request. Each request in our client logs that failed when the cluster went down I have been able to rerun successfully after restarting the nodes. Each time it has been happened the cluster has been able to recover returning to a green state after restarting the dead processes.

It sure looks like infinite recursion or some sort of pathological case that that causes more than a thousand frames. The thread name seems to imply that it is occurring in a bulk indexing operation. We index in bulk batches of 100 documents at a time.

We are using:
Ubuntu 14.04
Elasticsearch 5.2.2 (from https://artifacts.elastic.co/packages/5.x/apt)
OpenJDK 8 (u111-b14-3~14.04.1 from https://launchpad.net/~openjdk-r/+archive/ubuntu/ppa)

I am going to see if I can isolate this to a particular request but does anyone have an idea of what might be causing the above or have any other suggestions on how to debug it?

Thanks.

warkolm · March 8, 2017, 9:55pm

Do you have Monitoring installed to see if it highlights anything? Having that setup as a secondary cluster would be useful given the disruptive nature of the error!

jkcoz · March 16, 2017, 3:18am

It turns out this was not caused by any specific request but what looks like a memory leak-ish bug I have submitted here.

system · April 13, 2017, 3:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fatal error in thread java.lang.StackOverflowError:null Elasticsearch	1	1106	January 14, 2019
java.lang.StackOverflowError: null Elasticsearch	4	2150	June 20, 2017
Elasticsearch crashes with StackOverflowError Elasticsearch	13	4287	November 21, 2017
java.lang.StackOverflowError: null 异常导致es挂掉中文提问与讨论	3	1336	April 2, 2018
StackOverflowError in Lucene Elasticsearch	7	1547	June 29, 2018

5.2.2 OpenJDK 8 UnmodifiableCollection StackOverFlowError

Related topics