ElasticSearch server lock up

simonw_2 · February 11, 2014, 3:29pm

Those commits look good to me! I'd be super curious what you see in the
logs especially coming from this:

logger.warn("Searcher was released twice", new
ElasticSearchIllegalStateException("Double release"));

The bulk index does it contain delete by query or so?

simon

On Tuesday, February 11, 2014 1:53:02 PM UTC+1, Jay Modi wrote:

Simon, prior to this post I decided to try a custom build from the 0.90.11
tag with the following commits cherry picked onto my branch:
ad1097f1ba109d6cb235ba541251ba63abb27c16
b4ec18814b3eeb35d948c01abec3e04745f57458
93e9d2146e77f6c0523875b93c768ab7f81cfe04
6056d2cb5553e6277b023e6860739847fbd95bb7

I tried to minimize changes from the latest released version to the subset
that deal with changes for managing references. I'll see how this works.

Also, we were able to get some log information (client side) to find out
more details. The problems started occurring around 4:16am EST and
continued until 4:45am EST. We noticed issues started with a bulk index
request (not sure if that helps at all).

On Tuesday, February 11, 2014 4:21:12 AM UTC-5, simonw wrote:

Jay, first of all good that this prevents the server from going into an
infinite loop! Can you maybe build elasticsearch from the 0.90 branch (mvn
clean -DskipTests packge) and deploy the artifact that is in
target/release/ - we added some stuff to prevent this can you maybe give
it a go if that fixes your problem?

simon

On Monday, February 10, 2014 4:47:51 PM UTC+1, Jay Modi wrote:

I took the latest patch from the Lucene issue and patched the 4.6.0 tag
and started using that jar in an instance (running 0.90.8). It looks like
that instance hit the condition that previously would have caused the
infinite loop. Our logs filled up with exception messages about failing to
acquire a searcher. I have created a gist of a portion of the log file with
all of the failed to acquire messages
es acquire · GitHub

This appears to have gone on for about 15 to 30 minutes and then it
looks as though a flush occurred and the errors went away, see the
following gist: es flush · GitHub

Doing some log analysis I have received these counts of log messages:

Count Message
272,190 [Collector 3] [elements][2] failed to acquire searcher, source
search
2 [Collector 3] [elements][2] failed to acquire searcher,
source search_factory
1 [Collector 3] [elements][2] failed to prepare/warm

1 [Collector 3] [elements][2] Failed to close current
SearcherManager

It is interesting that it failed to prepare/warm because we do not
explicitly call any index warmer.

Is there something to be done here so the logs do not fill up with these
messages? We save about 550mb of logs and we churned through that much in
about 15 minutes when this happened, so I do not have log message from the
beginning of this event.
On Thursday, February 6, 2014 3:18:14 PM UTC-5, simonw wrote:

Thanks so much for the infos here! That is really helpful! Just to keep
you updated, I committed a fix for the infinite loop to lucene 5 min ago (
[LUCENE-5436] RefrenceManager#accquire can result in infinite loop if manager resource is abused outside of the manager - ASF JIRA) and we are working
on porting the fixes to ES as we speak. We are adding additional checks to
ES core to prevent this as much as possible. We will keep you updated.
Nevertheless I guess upgrading to the latest version 0.90.11 would be good
for you there are lots of sneaky things that are fixed!

simon

On Thursday, February 6, 2014 8:59:52 PM UTC+1, JBartak wrote:

On our case were were running on CentOS 6.4

We aren't deleting by query.

We are running parent child queries (although not a lot of them and
maybe not at the time this happened). We do regularly run nested queries.

We didn't see any exceptions in the logs.

Not sure what you mean by embedding Elasticsearch, but we were
running a 3 machine Elasticsearch cluster in Amazon and using no-data nodes
to connect to it using the Java API.

We have not reproduced the problem

It happened while running 90.7, but we haven't upgraded past that
yet.

We are not calling flush or updating engine settings.

On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:

Hey folks,

we are trying to get to the bottom of this and I'd want to have some
more infos here. Can you provide us with more insight in what you are doing
with ES. What I'd be interested in is:

are you using delete by query

are you running parent child queries

are you seeing any exceptions in the logs

are you embedding elasticsearch

can you reproduce the problem and if so what are you doing to do
so.

can you tell me the latest version you have see this happening? I
could see some problems in 0.90.7 that are fixed in 0.90.8 though.

are you forcefully flushing ie. calling flush via the API or
update engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:

Additionally, I took a heap dump from the ES 0.90.7 instance and I
believe I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index
readers since it looks like that is what the code is trying to increment a
count in when it acquires one. There were 24 readers in total associated
with SearcherManager instances. Every reader except one had a refCount of
1, one had a refCount of 0. Now if there is a race condition while
the code is in acquire where the refCount goes down to 0 (reader closes),
while the acquire method is being called the loop will run forever as
tryIncRef in a IndexReader will always return false if the refCount is not
greater than 0.

Hope that makes sense. FWIW my understanding of the code could be
wrong and something else could be going on; if clarification or more
information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java
API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with
JDK6. Restarting "fixes" it but we only run a single node so when this
happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.comwrote:

Coincidentally we hit a similar issue on Windows yesterday with
0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bf9bb845-c30a-47f6-928c-f1a814c0e06e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jaymode · February 12, 2014, 1:52am

I'll definitely post back with what we find in the logs with the updated
version.

The bulk indexes do not contain a delete by query; we're just converting
some Java objects to a XContentBuilder and sending them to be indexed like
so:
BulkRequestBuilder bulkRequest = client.prepareBulk();
for (1 ... 1000) {
XContentBuilder builder = convertJavaObjectToJson()
IndexRequestBuilder request = client.prepareIndex(index,
type).setSource(builder)
bulkRequest.add(request)
}
BulkResponse response = bulkRequest.setRefresh(false).execute().actionGet();

On Tuesday, February 11, 2014 10:29:05 AM UTC-5, simonw wrote:

Those commits look good to me! I'd be super curious what you see in the
logs especially coming from this:

logger.warn("Searcher was released twice", new
ElasticSearchIllegalStateException("Double release"));

The bulk index does it contain delete by query or so?

simon

On Tuesday, February 11, 2014 1:53:02 PM UTC+1, Jay Modi wrote:

Simon, prior to this post I decided to try a custom build from the
0.90.11 tag with the following commits cherry picked onto my branch:
ad1097f1ba109d6cb235ba541251ba63abb27c16
b4ec18814b3eeb35d948c01abec3e04745f57458
93e9d2146e77f6c0523875b93c768ab7f81cfe04
6056d2cb5553e6277b023e6860739847fbd95bb7

I tried to minimize changes from the latest released version to the
subset that deal with changes for managing references. I'll see how this
works.

Also, we were able to get some log information (client side) to find out
more details. The problems started occurring around 4:16am EST and
continued until 4:45am EST. We noticed issues started with a bulk index
request (not sure if that helps at all).

On Tuesday, February 11, 2014 4:21:12 AM UTC-5, simonw wrote:

Jay, first of all good that this prevents the server from going into an
infinite loop! Can you maybe build elasticsearch from the 0.90 branch (mvn
clean -DskipTests packge) and deploy the artifact that is in
target/release/ - we added some stuff to prevent this can you maybe give
it a go if that fixes your problem?

simon

On Monday, February 10, 2014 4:47:51 PM UTC+1, Jay Modi wrote:

I took the latest patch from the Lucene issue and patched the 4.6.0 tag
and started using that jar in an instance (running 0.90.8). It looks like
that instance hit the condition that previously would have caused the
infinite loop. Our logs filled up with exception messages about failing to
acquire a searcher. I have created a gist of a portion of the log file with
all of the failed to acquire messages
es acquire · GitHub

This appears to have gone on for about 15 to 30 minutes and then it
looks as though a flush occurred and the errors went away, see the
following gist: es flush · GitHub

Doing some log analysis I have received these counts of log messages:

Count Message
272,190 [Collector 3] [elements][2] failed to acquire searcher, source
search
2 [Collector 3] [elements][2] failed to acquire searcher,
source search_factory
1 [Collector 3] [elements][2] failed to prepare/warm

1 [Collector 3] [elements][2] Failed to close current
SearcherManager

It is interesting that it failed to prepare/warm because we do not
explicitly call any index warmer.

Is there something to be done here so the logs do not fill up with
these messages? We save about 550mb of logs and we churned through that
much in about 15 minutes when this happened, so I do not have log message
from the beginning of this event.
On Thursday, February 6, 2014 3:18:14 PM UTC-5, simonw wrote:

Thanks so much for the infos here! That is really helpful! Just to
keep you updated, I committed a fix for the infinite loop to lucene 5 min
ago ([LUCENE-5436] RefrenceManager#accquire can result in infinite loop if manager resource is abused outside of the manager - ASF JIRA) and we are
working on porting the fixes to ES as we speak. We are adding additional
checks to ES core to prevent this as much as possible. We will keep you
updated. Nevertheless I guess upgrading to the latest version 0.90.11 would
be good for you there are lots of sneaky things that are fixed!

simon

On Thursday, February 6, 2014 8:59:52 PM UTC+1, JBartak wrote:

On our case were were running on CentOS 6.4

We aren't deleting by query.

We are running parent child queries (although not a lot of them and
maybe not at the time this happened). We do regularly run nested queries.

We didn't see any exceptions in the logs.

Not sure what you mean by embedding Elasticsearch, but we were
running a 3 machine Elasticsearch cluster in Amazon and using no-data nodes
to connect to it using the Java API.

We have not reproduced the problem

It happened while running 90.7, but we haven't upgraded past that
yet.

We are not calling flush or updating engine settings.

On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:

Hey folks,

we are trying to get to the bottom of this and I'd want to have some
more infos here. Can you provide us with more insight in what you are doing
with ES. What I'd be interested in is:

are you using delete by query

are you running parent child queries

are you seeing any exceptions in the logs

are you embedding elasticsearch

can you reproduce the problem and if so what are you doing to do
so.

can you tell me the latest version you have see this happening? I
could see some problems in 0.90.7 that are fixed in 0.90.8 though.

are you forcefully flushing ie. calling flush via the API or
update engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:

Additionally, I took a heap dump from the ES 0.90.7 instance and I
believe I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index
readers since it looks like that is what the code is trying to increment a
count in when it acquires one. There were 24 readers in total associated
with SearcherManager instances. Every reader except one had a refCount of
1, one had a refCount of 0. Now if there is a race condition while
the code is in acquire where the refCount goes down to 0 (reader closes),
while the acquire method is being called the loop will run forever as
tryIncRef in a IndexReader will always return false if the refCount is not
greater than 0.

Hope that makes sense. FWIW my understanding of the code could be
wrong and something else could be going on; if clarification or more
information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java
API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with
JDK6. Restarting "fixes" it but we only run a single node so when this
happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.comwrote:

Coincidentally we hit a similar issue on Windows yesterday with
0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ac2ac1bb-249d-4369-b295-20561d9063f0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · February 27, 2014, 11:17am

We have several fixes relate to this in 0.90.12 and 1.0.1 - can you guys
try and upgrade? would be awesome if you could report back!

simon

On Thursday, January 23, 2014 1:47:31 AM UTC+1, JBartak wrote:

We had an interesting problem today while running a 3 data node
Elasticsearch 90.7 cluster inside Amazon (using m1.xlarge). One of the
data nodes went to 100% CPU and stayed there. Attempts to use the Java API
to talk to the cluster hung. This cluster has generally been stable and
there wasn't anything unusual going on that I was aware of when the problem
occurred. There was nothing written to the Elasticsearch logs on any of
the instances.

Elasticsearch Head's main screen said the cluster health was green, but no
nodes got displayed on the screen. The Cluster State tab did display all
the nodes.

As far as traffic goes, I'm not really sure what exactly was going on.
There were a bunch of indexing requests going on just prior to the lockup,
but I'm not sure what else was happening.

I've attached a JStack dump from the instance. (I had to run "jstack -F"
to get the dump.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bf658590-3215-48ea-be15-8eaca1b8fcf6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Cluster locks up Elasticsearch	9	1675	July 6, 2017
Lack of memory? Elasticsearch	11	804	July 6, 2017
Stuck thread problem? Elasticsearch	17	1751	July 6, 2017
Sudden 100% CPU usage by ElasticSearch Elasticsearch	8	934	July 6, 2017
ElasticSearch crashes OS? Elasticsearch	13	475	July 6, 2017

ElasticSearch server lock up

Related topics