ElasticSearch server lock up


(John Bartak) #1

We had an interesting problem today while running a 3 data node
ElasticSearch 90.7 cluster inside Amazon (using m1.xlarge). One of the
data nodes went to 100% CPU and stayed there. Attempts to use the Java API
to talk to the cluster hung. This cluster has generally been stable and
there wasn't anything unusual going on that I was aware of when the problem
occurred. There was nothing written to the ElasticSearch logs on any of
the instances.

ElasticSearch Head's main screen said the cluster health was green, but no
nodes got displayed on the screen. The Cluster State tab did display all
the nodes.

As far as traffic goes, I'm not really sure what exactly was going on.
There were a bunch of indexing requests going on just prior to the lockup,
but I'm not sure what else was happening.

I've attached a JStack dump from the instance. (I had to run "jstack -F"
to get the dump.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b46b4569-652f-4ab6-9a16-d0b1f52fa92d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #2

This 100% CPU usage and the fact that the application was completely
unresponsive make me think to a garbage collection issue. It could be that
you had memory pressure issues. and the gc entered a never-ending full
collection. Do you monitor garbage collection? Having garbage collection
statistics right before the application got unresponsive would help confirm
this is the issue.

On Thu, Jan 23, 2014 at 1:47 AM, JBartak john.bartak@autodesk.com wrote:

We had an interesting problem today while running a 3 data node
ElasticSearch 90.7 cluster inside Amazon (using m1.xlarge). One of the
data nodes went to 100% CPU and stayed there. Attempts to use the Java API
to talk to the cluster hung. This cluster has generally been stable and
there wasn't anything unusual going on that I was aware of when the problem
occurred. There was nothing written to the ElasticSearch logs on any of
the instances.

ElasticSearch Head's main screen said the cluster health was green, but no
nodes got displayed on the screen. The Cluster State tab did display all
the nodes.

As far as traffic goes, I'm not really sure what exactly was going on.
There were a bunch of indexing requests going on just prior to the lockup,
but I'm not sure what else was happening.

I've attached a JStack dump from the instance. (I had to run "jstack -F"
to get the dump.)

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b46b4569-652f-4ab6-9a16-d0b1f52fa92d%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j4nmvFrf6gBrQj5%3DPT3FwyLmYjKXCxPyWd_8CVhTM60NQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(John Bartak) #3

We don't currently keep garbage collection statistics, but I'll add them
just in case this ever comes up again.

For what it's worth, the memory settings for the Java process
were: -Xms9006m -Xmx9006m -Xss256k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC

On Thursday, January 23, 2014 1:33:28 PM UTC-8, Adrien Grand wrote:

This 100% CPU usage and the fact that the application was completely
unresponsive make me think to a garbage collection issue. It could be that
you had memory pressure issues. and the gc entered a never-ending full
collection. Do you monitor garbage collection? Having garbage collection
statistics right before the application got unresponsive would help confirm
this is the issue.

On Thu, Jan 23, 2014 at 1:47 AM, JBartak <john....@autodesk.com<javascript:>

wrote:

We had an interesting problem today while running a 3 data node
ElasticSearch 90.7 cluster inside Amazon (using m1.xlarge). One of the
data nodes went to 100% CPU and stayed there. Attempts to use the Java API
to talk to the cluster hung. This cluster has generally been stable and
there wasn't anything unusual going on that I was aware of when the problem
occurred. There was nothing written to the ElasticSearch logs on any of
the instances.

ElasticSearch Head's main screen said the cluster health was green, but
no nodes got displayed on the screen. The Cluster State tab did display
all the nodes.

As far as traffic goes, I'm not really sure what exactly was going on.
There were a bunch of indexing requests going on just prior to the lockup,
but I'm not sure what else was happening.

I've attached a JStack dump from the instance. (I had to run "jstack -F"
to get the dump.)

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b46b4569-652f-4ab6-9a16-d0b1f52fa92d%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/76049f7a-a67a-46c6-8bcb-155b01c0b414%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jay Modi) #4

Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and
JDK 7u45. I took a jstack dump and noticed that all of my search threads
were in ReferenceManager.acquire(). Looking at the Lucene code for acquire
there is a "busy wait" while it tries to increment a reference. The system
has stayed in this state and subsequent jstack dumps show the same behavior
with all search threads in acquire. Prior to the lockup a index had
completed and the lockup with a few searches and count queries.

I have also seen similar issues reported in previous versions of
elasticsearch:
https://groups.google.com/forum/#!msg/elasticsearch/A_4ZgfE5xF0/NH8qMASIhvUJ
https://groups.google.com/forum/#!searchin/elasticsearch/ElasticSeach$20full$20CPU$20usage/elasticsearch/XCkWAPBQTvs/VWtpdbHEsEEJ

On Wednesday, January 22, 2014 7:47:31 PM UTC-5, JBartak wrote:

We had an interesting problem today while running a 3 data node
ElasticSearch 90.7 cluster inside Amazon (using m1.xlarge). One of the
data nodes went to 100% CPU and stayed there. Attempts to use the Java API
to talk to the cluster hung. This cluster has generally been stable and
there wasn't anything unusual going on that I was aware of when the problem
occurred. There was nothing written to the ElasticSearch logs on any of
the instances.

ElasticSearch Head's main screen said the cluster health was green, but no
nodes got displayed on the screen. The Cluster State tab did display all
the nodes.

As far as traffic goes, I'm not really sure what exactly was going on.
There were a bunch of indexing requests going on just prior to the lockup,
but I'm not sure what else was happening.

I've attached a JStack dump from the instance. (I had to run "jstack -F"
to get the dump.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f727a1e8-7c1f-49f8-b4bc-f15040575b91%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(John Bartak) #5

We're running on CentOS 6.3 and JDK 7u25. My jstack dump shows lots of
threads in ReferenceManager.acquire() as well. Next time this happens,
I'll definitely get the GC statistics and get the hot threads.

On Friday, January 24, 2014 8:17:38 AM UTC-8, Jay Modi wrote:

Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and
JDK 7u45. I took a jstack dump and noticed that all of my search threads
were in ReferenceManager.acquire(). Looking at the Lucene code for acquire
there is a "busy wait" while it tries to increment a reference. The system
has stayed in this state and subsequent jstack dumps show the same behavior
with all search threads in acquire. Prior to the lockup a index had
completed and the lockup with a few searches and count queries.

I have also seen similar issues reported in previous versions of
elasticsearch:

https://groups.google.com/forum/#!msg/elasticsearch/A_4ZgfE5xF0/NH8qMASIhvUJ

https://groups.google.com/forum/#!searchin/elasticsearch/ElasticSeach$20full$20CPU$20usage/elasticsearch/XCkWAPBQTvs/VWtpdbHEsEEJ

On Wednesday, January 22, 2014 7:47:31 PM UTC-5, JBartak wrote:

We had an interesting problem today while running a 3 data node
ElasticSearch 90.7 cluster inside Amazon (using m1.xlarge). One of the
data nodes went to 100% CPU and stayed there. Attempts to use the Java API
to talk to the cluster hung. This cluster has generally been stable and
there wasn't anything unusual going on that I was aware of when the problem
occurred. There was nothing written to the ElasticSearch logs on any of
the instances.

ElasticSearch Head's main screen said the cluster health was green, but
no nodes got displayed on the screen. The Cluster State tab did display
all the nodes.

As far as traffic goes, I'm not really sure what exactly was going on.
There were a bunch of indexing requests going on just prior to the lockup,
but I'm not sure what else was happening.

I've attached a JStack dump from the instance. (I had to run "jstack -F"
to get the dump.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3f345ecc-988f-4da6-a6c4-00c52f813489%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jay Modi) #6

I'm not sure that the GC is to blame here but the statistics couldn't hurt
and the hot threads will be the most telling IMO.

When we have seen GC problems, the JVM typically hits 100% CPU usage on a
single core but not 100% total usage of all 4 cores like what we are seeing
on our system and what I believe you are seeing.

On Friday, January 24, 2014 12:17:19 PM UTC-5, JBartak wrote:

We're running on CentOS 6.3 and JDK 7u25. My jstack dump shows lots of
threads in ReferenceManager.acquire() as well. Next time this happens,
I'll definitely get the GC statistics and get the hot threads.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/712c6f4b-9a93-490d-9b37-2c415a13082f%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Adrien Grand) #7

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jaymode@gmail.com wrote:

Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and
JDK 7u45. I took a jstack dump and noticed that all of my search threads
were in ReferenceManager.acquire(). Looking at the Lucene code for acquire
there is a "busy wait" while it tries to increment a reference. The system
has stayed in this state and subsequent jstack dumps show the same behavior
with all search threads in acquire. Prior to the lockup a index had
completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6zhC_c%2BeD0RRkBfZfgej10%2Bcm1ZDWzLKHG3gahbd4-MA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jay Modi) #8

No we are not running any plugins. Our application uses the Java API to
query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6.
Restarting "fixes" it but we only run a single node so when this happens
everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <jay...@gmail.com <javascript:>>wrote:

Coincidentally we hit a similar issue on Windows yesterday with 0.90.8
and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bdb083a5-0728-44e9-ab02-2604b7298bc6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jay Modi) #9

Additionally, I took a heap dump from the ES 0.90.7 instance and I believe
I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index readers since
it looks like that is what the code is trying to increment a count in when
it acquires one. There were 24 readers in total associated with
SearcherManager instances. Every reader except one had a refCount of 1, one
had a refCount of 0. Now if there is a race condition while the code is in
acquire where the refCount goes down to 0 (reader closes), while the
acquire method is being called the loop will run forever as tryIncRef in a
IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong and
something else could be going on; if clarification or more information from
the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java API to
query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6.
Restarting "fixes" it but we only run a single node so when this happens
everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.com wrote:

Coincidentally we hit a similar issue on Windows yesterday with 0.90.8
and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3b1d80a1-8e58-4728-933a-80f6d8ef4ed9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(simonw-3) #10

Hey folks,

we are trying to get to the bottom of this and I'd want to have some more
infos here. Can you provide us with more insight in what you are doing with
ES. What I'd be interested in is:

  • are you using delete by query
  • are you running parent child queries
  • are you seeing any exceptions in the logs
  • are you embedding elasticsearch
  • can you reproduce the problem and if so what are you doing to do so.
  • can you tell me the latest version you have see this happening? I could
    see some problems in 0.90.7 that are fixed in 0.90.8 though.
  • are you forcefully flushing ie. calling flush via the API or update
    engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:

Additionally, I took a heap dump from the ES 0.90.7 instance and I believe
I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index readers
since it looks like that is what the code is trying to increment a count in
when it acquires one. There were 24 readers in total associated with
SearcherManager instances. Every reader except one had a refCount of 1, one
had a refCount of 0. Now if there is a race condition while the code is
in acquire where the refCount goes down to 0 (reader closes), while the
acquire method is being called the loop will run forever as tryIncRef in a
IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong
and something else could be going on; if clarification or more information
from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java API to
query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6.
Restarting "fixes" it but we only run a single node so when this happens
everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.com wrote:

Coincidentally we hit a similar issue on Windows yesterday with 0.90.8
and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4c2a1413-edf0-4fc4-8f10-f4c73196c83c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(John Bartak) #11

On our case were were running on CentOS 6.4:

  • We aren't deleting by query.
  • We are running parent child queries (although not a lot of them and maybe
    not at the time this happened). We do regularly run nested queries.
  • We didn't see any exceptions in the logs.
  • Not sure what you mean by embedding ElasticSearch, but we were running a
    3 machine ElasticSearch cluster in Amazon and using no-data nodes to
    connect to it using the Java API.
  • We have not reproduced the problem
  • It happened while running 90.7, but we haven't upgraded past that yet.
  • We are not calling flush or updating engine settings.

On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:

Hey folks,

we are trying to get to the bottom of this and I'd want to have some more
infos here. Can you provide us with more insight in what you are doing with
ES. What I'd be interested in is:

  • are you using delete by query
  • are you running parent child queries
  • are you seeing any exceptions in the logs
  • are you embedding elasticsearch
  • can you reproduce the problem and if so what are you doing to do so.
  • can you tell me the latest version you have see this happening? I could
    see some problems in 0.90.7 that are fixed in 0.90.8 though.
  • are you forcefully flushing ie. calling flush via the API or update
    engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:

Additionally, I took a heap dump from the ES 0.90.7 instance and I
believe I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index readers
since it looks like that is what the code is trying to increment a count in
when it acquires one. There were 24 readers in total associated with
SearcherManager instances. Every reader except one had a refCount of 1, one
had a refCount of 0. Now if there is a race condition while the code is
in acquire where the refCount goes down to 0 (reader closes), while the
acquire method is being called the loop will run forever as tryIncRef in a
IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong
and something else could be going on; if clarification or more information
from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java API to
query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6.
Restarting "fixes" it but we only run a single node so when this happens
everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.com wrote:

Coincidentally we hit a similar issue on Windows yesterday with 0.90.8
and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ccde3698-27e0-4c27-ae98-d38701840094%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(simonw-3) #12

Thanks so much for the infos here! That is really helpful! Just to keep you
updated, I committed a fix for the infinite loop to lucene 5 min ago
(https://issues.apache.org/jira/browse/LUCENE-5436) and we are working on
porting the fixes to ES as we speak. We are adding additional checks to ES
core to prevent this as much as possible. We will keep you updated.
Nevertheless I guess upgrading to the latest version 0.90.11 would be good
for you there are lots of sneaky things that are fixed!

simon

On Thursday, February 6, 2014 8:59:52 PM UTC+1, JBartak wrote:

On our case were were running on CentOS 6.4

  • We aren't deleting by query.
  • We are running parent child queries (although not a lot of them and
    maybe not at the time this happened). We do regularly run nested queries.
  • We didn't see any exceptions in the logs.
  • Not sure what you mean by embedding ElasticSearch, but we were running a
    3 machine ElasticSearch cluster in Amazon and using no-data nodes to
    connect to it using the Java API.
  • We have not reproduced the problem
  • It happened while running 90.7, but we haven't upgraded past that yet.
  • We are not calling flush or updating engine settings.

On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:

Hey folks,

we are trying to get to the bottom of this and I'd want to have some more
infos here. Can you provide us with more insight in what you are doing with
ES. What I'd be interested in is:

  • are you using delete by query
  • are you running parent child queries
  • are you seeing any exceptions in the logs
  • are you embedding elasticsearch
  • can you reproduce the problem and if so what are you doing to do so.
  • can you tell me the latest version you have see this happening? I
    could see some problems in 0.90.7 that are fixed in 0.90.8 though.
  • are you forcefully flushing ie. calling flush via the API or update
    engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:

Additionally, I took a heap dump from the ES 0.90.7 instance and I
believe I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index readers
since it looks like that is what the code is trying to increment a count in
when it acquires one. There were 24 readers in total associated with
SearcherManager instances. Every reader except one had a refCount of 1, one
had a refCount of 0. Now if there is a race condition while the code is
in acquire where the refCount goes down to 0 (reader closes), while the
acquire method is being called the loop will run forever as tryIncRef in a
IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong
and something else could be going on; if clarification or more information
from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java API to
query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with
JDK6. Restarting "fixes" it but we only run a single node so when this
happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.com wrote:

Coincidentally we hit a similar issue on Windows yesterday with
0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7367ea44-521c-4243-9a94-b4179d51602d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jay Modi) #13

Hi Simon,

  • Yes we use the delete by query

  • We do not use the hasChild and hasParent queries but use filtered queries
    on parent ids like this:

      QueryBuilder query = QueryBuilders.matchQuery( "someField", value );
    
      FilterBuilder filter = FilterBuilders.boolFilter().must( 
    

FilterBuilders.termFilter( "_parent", id ) );

    QueryBuilder filteredQuery = QueryBuilders.filteredQuery( query, 

filter );

  • No Exceptions in the logs
  • Yes we embed elasticsearch in our application (running as a single node)
  • No we cannot reproduce on demand; it has occurred during normal runtime
    of our application
  • We have seen the issue in 0.20.2 (JDK6) and 0.90.8 (JDK7u45)
  • No not forcefully flushing

On Thursday, February 6, 2014 2:47:38 PM UTC-5, simonw wrote:

Hey folks,

we are trying to get to the bottom of this and I'd want to have some more
infos here. Can you provide us with more insight in what you are doing with
ES. What I'd be interested in is:

  • are you using delete by query
  • are you running parent child queries
  • are you seeing any exceptions in the logs
  • are you embedding elasticsearch
  • can you reproduce the problem and if so what are you doing to do so.
  • can you tell me the latest version you have see this happening? I could
    see some problems in 0.90.7 that are fixed in 0.90.8 though.
  • are you forcefully flushing ie. calling flush via the API or update
    engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:

Additionally, I took a heap dump from the ES 0.90.7 instance and I
believe I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index readers
since it looks like that is what the code is trying to increment a count in
when it acquires one. There were 24 readers in total associated with
SearcherManager instances. Every reader except one had a refCount of 1, one
had a refCount of 0. Now if there is a race condition while the code is
in acquire where the refCount goes down to 0 (reader closes), while the
acquire method is being called the loop will run forever as tryIncRef in a
IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong
and something else could be going on; if clarification or more information
from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java API to
query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6.
Restarting "fixes" it but we only run a single node so when this happens
everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.com wrote:

Coincidentally we hit a similar issue on Windows yesterday with 0.90.8
and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2b53805b-9d9e-4bd1-bd45-2c30df2ddef1%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(simonw-3) #14

thanks for the info though! Here is the pull request to throw an exception
instead of going into an infinite loop

I assume you are not indexing into this embedded instance?

simon

On Thursday, February 6, 2014 9:23:19 PM UTC+1, Jay Modi wrote:

Hi Simon,

  • Yes we use the delete by query

  • We do not use the hasChild and hasParent queries but use filtered
    queries on parent ids like this:

      QueryBuilder query = QueryBuilders.matchQuery( "someField", value 
    

);

    FilterBuilder filter = FilterBuilders.boolFilter().must( 

FilterBuilders.termFilter( "_parent", id ) );

    QueryBuilder filteredQuery = QueryBuilders.filteredQuery( query, 

filter );

  • No Exceptions in the logs
  • Yes we embed elasticsearch in our application (running as a single node)
  • No we cannot reproduce on demand; it has occurred during normal runtime
    of our application
  • We have seen the issue in 0.20.2 (JDK6) and 0.90.8 (JDK7u45)
  • No not forcefully flushing

On Thursday, February 6, 2014 2:47:38 PM UTC-5, simonw wrote:

Hey folks,

we are trying to get to the bottom of this and I'd want to have some more
infos here. Can you provide us with more insight in what you are doing with
ES. What I'd be interested in is:

  • are you using delete by query
  • are you running parent child queries
  • are you seeing any exceptions in the logs
  • are you embedding elasticsearch
  • can you reproduce the problem and if so what are you doing to do so.
  • can you tell me the latest version you have see this happening? I
    could see some problems in 0.90.7 that are fixed in 0.90.8 though.
  • are you forcefully flushing ie. calling flush via the API or update
    engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:

Additionally, I took a heap dump from the ES 0.90.7 instance and I
believe I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index readers
since it looks like that is what the code is trying to increment a count in
when it acquires one. There were 24 readers in total associated with
SearcherManager instances. Every reader except one had a refCount of 1, one
had a refCount of 0. Now if there is a race condition while the code is
in acquire where the refCount goes down to 0 (reader closes), while the
acquire method is being called the loop will run forever as tryIncRef in a
IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong
and something else could be going on; if clarification or more information
from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java API to
query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with
JDK6. Restarting "fixes" it but we only run a single node so when this
happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.com wrote:

Coincidentally we hit a similar issue on Windows yesterday with
0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/69f9211a-1aaa-4fa7-b581-624face2091b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(simonw-2) #15

Hey folks,

am I right that you both use scan / scroll? Can you tell me a bit about how
you use it and if there could be long intervals between fetching results?

simon
On Thursday, February 6, 2014 9:34:45 PM UTC+1, simonw wrote:

thanks for the info though! Here is the pull request to throw an
exception instead of going into an infinite loop

https://github.com/elasticsearch/elasticsearch/pull/5043

I assume you are not indexing into this embedded instance?

simon

On Thursday, February 6, 2014 9:23:19 PM UTC+1, Jay Modi wrote:

Hi Simon,

  • Yes we use the delete by query

  • We do not use the hasChild and hasParent queries but use filtered
    queries on parent ids like this:

      QueryBuilder query = QueryBuilders.matchQuery( "someField", 
    

value );

    FilterBuilder filter = FilterBuilders.boolFilter().must( 

FilterBuilders.termFilter( "_parent", id ) );

    QueryBuilder filteredQuery = QueryBuilders.filteredQuery( query, 

filter );

  • No Exceptions in the logs
  • Yes we embed elasticsearch in our application (running as a single node)
  • No we cannot reproduce on demand; it has occurred during normal runtime
    of our application
  • We have seen the issue in 0.20.2 (JDK6) and 0.90.8 (JDK7u45)
  • No not forcefully flushing

On Thursday, February 6, 2014 2:47:38 PM UTC-5, simonw wrote:

Hey folks,

we are trying to get to the bottom of this and I'd want to have some
more infos here. Can you provide us with more insight in what you are doing
with ES. What I'd be interested in is:

  • are you using delete by query
  • are you running parent child queries
  • are you seeing any exceptions in the logs
  • are you embedding elasticsearch
  • can you reproduce the problem and if so what are you doing to do so.
  • can you tell me the latest version you have see this happening? I
    could see some problems in 0.90.7 that are fixed in 0.90.8 though.
  • are you forcefully flushing ie. calling flush via the API or update
    engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:

Additionally, I took a heap dump from the ES 0.90.7 instance and I
believe I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index readers
since it looks like that is what the code is trying to increment a count in
when it acquires one. There were 24 readers in total associated with
SearcherManager instances. Every reader except one had a refCount of 1, one
had a refCount of 0. Now if there is a race condition while the code
is in acquire where the refCount goes down to 0 (reader closes), while the
acquire method is being called the loop will run forever as tryIncRef in a
IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong
and something else could be going on; if clarification or more information
from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java API
to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with
JDK6. Restarting "fixes" it but we only run a single node so when this
happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.com wrote:

Coincidentally we hit a similar issue on Windows yesterday with
0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3bf685c5-1abd-4a58-9341-3d19f57465f9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(John Bartak) #16

We definitely use scan/scroll a lot. It primarily gets used as part of our indexing process. I would anticipate that we generally only get back a single page of results. I would expect the calls to scrollResponse.getHits to happen in fairly quick succession.

We typically use something that looks like the following:

   TimeValue scrollKeepAlive = new TimeValue(60000);

    SearchRequestBuilder searchRequest = client.client().prepareSearch(indices);                          // always either 1 or 2 indices
            .setSearchType(SearchType.SCAN)
            .setTypes(type)
            .setScroll(scrollKeepAlive)
            .setFilter(filter)
            .setSize(500);
    addFieldsToSearchRequest(searchRequest, fieldList);

    org.elasticsearch.action.search.SearchResponse scrollResp = searchRequest.execute().actionGet();

    if (scrollResp.status() != RestStatus.OK) {
        String msg = "Error running query. Status=" + scrollResp.status();
        throw new SearcherException(msg);
    }

    while (true) {
        scrollResp = client.client().prepareSearchScroll(scrollResp.getScrollId()).setScroll(scrollKeepAlive).execute().actionGet();

        if (scrollResp.status() != RestStatus.OK) {
            String msg = "Error running query. Status=" + scrollResp.status();
            throw new SearcherException(msg);
        }

        if (scrollResp.getHits().hits().length == 0)
            break;

        for (SearchHit hit : scrollResp.getHits()) {
                // DO SOMETHING WITH THE HIT (that doesn’t take very long)
        }
    }

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of simonw
Sent: Thursday, February 06, 2014 1:18 PM
To: elasticsearch@googlegroups.com
Subject: Re: ElasticSearch server lock up

Hey folks,

am I right that you both use scan / scroll? Can you tell me a bit about how you use it and if there could be long intervals between fetching results?

simon
On Thursday, February 6, 2014 9:34:45 PM UTC+1, simonw wrote:
thanks for the info though! Here is the pull request to throw an exception instead of going into an infinite loop

I assume you are not indexing into this embedded instance?

simon

On Thursday, February 6, 2014 9:23:19 PM UTC+1, Jay Modi wrote:
Hi Simon,

  • Yes we use the delete by query

  • We do not use the hasChild and hasParent queries but use filtered queries on parent ids like this:

      QueryBuilder query = QueryBuilders.matchQuery( "someField", value );
    
      FilterBuilder filter = FilterBuilders.boolFilter().must( FilterBuilders.termFilter( "_parent", id ) );
    
      QueryBuilder filteredQuery = QueryBuilders.filteredQuery( query, filter );
    
  • No Exceptions in the logs

  • Yes we embed elasticsearch in our application (running as a single node)

  • No we cannot reproduce on demand; it has occurred during normal runtime of our application

  • We have seen the issue in 0.20.2 (JDK6) and 0.90.8 (JDK7u45)

  • No not forcefully flushing

On Thursday, February 6, 2014 2:47:38 PM UTC-5, simonw wrote:
Hey folks,

we are trying to get to the bottom of this and I'd want to have some more infos here. Can you provide us with more insight in what you are doing with ES. What I'd be interested in is:

  • are you using delete by query
  • are you running parent child queries
  • are you seeing any exceptions in the logs
  • are you embedding elasticsearch
  • can you reproduce the problem and if so what are you doing to do so.
  • can you tell me the latest version you have see this happening? I could see some problems in 0.90.7 that are fixed in 0.90.8 though.
  • are you forcefully flushing ie. calling flush via the API or update engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:
Additionally, I took a heap dump from the ES 0.90.7 instance and I believe I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index readers since it looks like that is what the code is trying to increment a count in when it acquires one. There were 24 readers in total associated with SearcherManager instances. Every reader except one had a refCount of 1, one had a refCount of 0. Now if there is a race condition while the code is in acquire where the refCount goes down to 0 (reader closes), while the acquire method is being called the loop will run forever as tryIncRef in a IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong and something else could be going on; if clarification or more information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
No we are not running any plugins. Our application uses the Java API to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with JDK6. Restarting "fixes" it but we only run a single node so when this happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <jay...@gmail.commailto:jay...@gmail.com> wrote:
Coincidentally we hit a similar issue on Windows yesterday with 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search threads were in ReferenceManager.acquire(). Looking at the Lucene code for acquire there is a "busy wait" while it tries to increment a reference. The system has stayed in this state and subsequent jstack dumps show the same behavior with all search threads in acquire. Prior to the lockup a index had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may acquire/release searchers?

--
Adrien Grand

You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/zK5pLxVeLJ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.commailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3bf685c5-1abd-4a58-9341-3d19f57465f9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/A23CDD02F91F7D4293CEA2CEA11E5BF20EDF4455%40005-TK5MPN2-021.MGDADSK.autodesk.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jay Modi) #17

Yes we do index into the embedded instance; our work load typically
consists of data incoming and a initial index and then after that most
operations are updates and retrievals. Is there a specific reason you ask?

As for scan/scroll, I don't see anywhere that we use it.

On Thursday, February 6, 2014 3:34:45 PM UTC-5, simonw wrote:

thanks for the info though! Here is the pull request to throw an
exception instead of going into an infinite loop

https://github.com/elasticsearch/elasticsearch/pull/5043

I assume you are not indexing into this embedded instance?

simon

On Thursday, February 6, 2014 9:23:19 PM UTC+1, Jay Modi wrote:

Hi Simon,

  • Yes we use the delete by query

  • We do not use the hasChild and hasParent queries but use filtered
    queries on parent ids like this:

      QueryBuilder query = QueryBuilders.matchQuery( "someField", 
    

value );

    FilterBuilder filter = FilterBuilders.boolFilter().must( 

FilterBuilders.termFilter( "_parent", id ) );

    QueryBuilder filteredQuery = QueryBuilders.filteredQuery( query, 

filter );

  • No Exceptions in the logs
  • Yes we embed elasticsearch in our application (running as a single node)
  • No we cannot reproduce on demand; it has occurred during normal runtime
    of our application
  • We have seen the issue in 0.20.2 (JDK6) and 0.90.8 (JDK7u45)
  • No not forcefully flushing

On Thursday, February 6, 2014 2:47:38 PM UTC-5, simonw wrote:

Hey folks,

we are trying to get to the bottom of this and I'd want to have some
more infos here. Can you provide us with more insight in what you are doing
with ES. What I'd be interested in is:

  • are you using delete by query
  • are you running parent child queries
  • are you seeing any exceptions in the logs
  • are you embedding elasticsearch
  • can you reproduce the problem and if so what are you doing to do so.
  • can you tell me the latest version you have see this happening? I
    could see some problems in 0.90.7 that are fixed in 0.90.8 though.
  • are you forcefully flushing ie. calling flush via the API or update
    engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:

Additionally, I took a heap dump from the ES 0.90.7 instance and I
believe I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index readers
since it looks like that is what the code is trying to increment a count in
when it acquires one. There were 24 readers in total associated with
SearcherManager instances. Every reader except one had a refCount of 1, one
had a refCount of 0. Now if there is a race condition while the code
is in acquire where the refCount goes down to 0 (reader closes), while the
acquire method is being called the loop will run forever as tryIncRef in a
IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong
and something else could be going on; if clarification or more information
from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java API
to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with
JDK6. Restarting "fixes" it but we only run a single node so when this
happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.com wrote:

Coincidentally we hit a similar issue on Windows yesterday with
0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/db483929-c009-4e83-965c-81c2011a18ed%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jay Modi) #18

I took the latest patch from the Lucene issue and patched the 4.6.0 tag and
started using that jar in an instance (running 0.90.8). It looks like that
instance hit the condition that previously would have caused the infinite
loop. Our logs filled up with exception messages about failing to acquire a
searcher. I have created a gist of a portion of the log file with all of
the failed to acquire messages

This appears to have gone on for about 15 to 30 minutes and then it looks
as though a flush occurred and the errors went away, see the following
gist: https://gist.github.com/jaymode/421919a3471f8358d5d4

Doing some log analysis I have received these counts of log messages:

Count Message
272,190 [Collector 3] [elements][2] failed to acquire searcher, source
search
2 [Collector 3] [elements][2] failed to acquire searcher,
source search_factory
1 [Collector 3] [elements][2] failed to prepare/warm

1 [Collector 3] [elements][2] Failed to close current
SearcherManager

It is interesting that it failed to prepare/warm because we do not
explicitly call any index warmer.

Is there something to be done here so the logs do not fill up with these
messages? We save about 550mb of logs and we churned through that much in
about 15 minutes when this happened, so I do not have log message from the
beginning of this event.
On Thursday, February 6, 2014 3:18:14 PM UTC-5, simonw wrote:

Thanks so much for the infos here! That is really helpful! Just to keep
you updated, I committed a fix for the infinite loop to lucene 5 min ago (
https://issues.apache.org/jira/browse/LUCENE-5436) and we are working on
porting the fixes to ES as we speak. We are adding additional checks to ES
core to prevent this as much as possible. We will keep you updated.
Nevertheless I guess upgrading to the latest version 0.90.11 would be good
for you there are lots of sneaky things that are fixed!

simon

On Thursday, February 6, 2014 8:59:52 PM UTC+1, JBartak wrote:

On our case were were running on CentOS 6.4

  • We aren't deleting by query.
  • We are running parent child queries (although not a lot of them and
    maybe not at the time this happened). We do regularly run nested queries.
  • We didn't see any exceptions in the logs.
  • Not sure what you mean by embedding ElasticSearch, but we were running
    a 3 machine ElasticSearch cluster in Amazon and using no-data nodes to
    connect to it using the Java API.
  • We have not reproduced the problem
  • It happened while running 90.7, but we haven't upgraded past that yet.
  • We are not calling flush or updating engine settings.

On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:

Hey folks,

we are trying to get to the bottom of this and I'd want to have some
more infos here. Can you provide us with more insight in what you are doing
with ES. What I'd be interested in is:

  • are you using delete by query
  • are you running parent child queries
  • are you seeing any exceptions in the logs
  • are you embedding elasticsearch
  • can you reproduce the problem and if so what are you doing to do so.
  • can you tell me the latest version you have see this happening? I
    could see some problems in 0.90.7 that are fixed in 0.90.8 though.
  • are you forcefully flushing ie. calling flush via the API or update
    engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:

Additionally, I took a heap dump from the ES 0.90.7 instance and I
believe I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index readers
since it looks like that is what the code is trying to increment a count in
when it acquires one. There were 24 readers in total associated with
SearcherManager instances. Every reader except one had a refCount of 1, one
had a refCount of 0. Now if there is a race condition while the code
is in acquire where the refCount goes down to 0 (reader closes), while the
acquire method is being called the loop will run forever as tryIncRef in a
IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be wrong
and something else could be going on; if clarification or more information
from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java API
to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with
JDK6. Restarting "fixes" it but we only run a single node so when this
happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.com wrote:

Coincidentally we hit a similar issue on Windows yesterday with
0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5e38e60b-3f23-4a6e-9783-ec73d65e59b0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(simonw-2) #19

Jay, first of all good that this prevents the server from going into an
infinite loop! Can you maybe build elasticsearch from the 0.90 branch (mvn
clean -DskipTests packge) and deploy the artifact that is in
target/release/ - we added some stuff to prevent this can you maybe give
it a go if that fixes your problem?

simon

On Monday, February 10, 2014 4:47:51 PM UTC+1, Jay Modi wrote:

I took the latest patch from the Lucene issue and patched the 4.6.0 tag
and started using that jar in an instance (running 0.90.8). It looks like
that instance hit the condition that previously would have caused the
infinite loop. Our logs filled up with exception messages about failing to
acquire a searcher. I have created a gist of a portion of the log file with
all of the failed to acquire messages
https://gist.github.com/jaymode/a73a9695e31d7ec53f3f

This appears to have gone on for about 15 to 30 minutes and then it looks
as though a flush occurred and the errors went away, see the following
gist: https://gist.github.com/jaymode/421919a3471f8358d5d4

Doing some log analysis I have received these counts of log messages:

Count Message
272,190 [Collector 3] [elements][2] failed to acquire searcher, source
search
2 [Collector 3] [elements][2] failed to acquire searcher,
source search_factory
1 [Collector 3] [elements][2] failed to prepare/warm

1 [Collector 3] [elements][2] Failed to close current
SearcherManager

It is interesting that it failed to prepare/warm because we do not
explicitly call any index warmer.

Is there something to be done here so the logs do not fill up with these
messages? We save about 550mb of logs and we churned through that much in
about 15 minutes when this happened, so I do not have log message from the
beginning of this event.
On Thursday, February 6, 2014 3:18:14 PM UTC-5, simonw wrote:

Thanks so much for the infos here! That is really helpful! Just to keep
you updated, I committed a fix for the infinite loop to lucene 5 min ago (
https://issues.apache.org/jira/browse/LUCENE-5436) and we are working on
porting the fixes to ES as we speak. We are adding additional checks to ES
core to prevent this as much as possible. We will keep you updated.
Nevertheless I guess upgrading to the latest version 0.90.11 would be good
for you there are lots of sneaky things that are fixed!

simon

On Thursday, February 6, 2014 8:59:52 PM UTC+1, JBartak wrote:

On our case were were running on CentOS 6.4

  • We aren't deleting by query.
  • We are running parent child queries (although not a lot of them and
    maybe not at the time this happened). We do regularly run nested queries.
  • We didn't see any exceptions in the logs.
  • Not sure what you mean by embedding ElasticSearch, but we were running
    a 3 machine ElasticSearch cluster in Amazon and using no-data nodes to
    connect to it using the Java API.
  • We have not reproduced the problem
  • It happened while running 90.7, but we haven't upgraded past that yet.
  • We are not calling flush or updating engine settings.

On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:

Hey folks,

we are trying to get to the bottom of this and I'd want to have some
more infos here. Can you provide us with more insight in what you are doing
with ES. What I'd be interested in is:

  • are you using delete by query
  • are you running parent child queries
  • are you seeing any exceptions in the logs
  • are you embedding elasticsearch
  • can you reproduce the problem and if so what are you doing to do so.
  • can you tell me the latest version you have see this happening? I
    could see some problems in 0.90.7 that are fixed in 0.90.8 though.
  • are you forcefully flushing ie. calling flush via the API or update
    engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:

Additionally, I took a heap dump from the ES 0.90.7 instance and I
believe I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index readers
since it looks like that is what the code is trying to increment a count in
when it acquires one. There were 24 readers in total associated with
SearcherManager instances. Every reader except one had a refCount of 1, one
had a refCount of 0. Now if there is a race condition while the code
is in acquire where the refCount goes down to 0 (reader closes), while the
acquire method is being called the loop will run forever as tryIncRef in a
IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be
wrong and something else could be going on; if clarification or more
information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java API
to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with
JDK6. Restarting "fixes" it but we only run a single node so when this
happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.com wrote:

Coincidentally we hit a similar issue on Windows yesterday with
0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/71e906b3-ee7c-49d3-9072-40ba4e935159%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jay Modi) #20

Simon, prior to this post I decided to try a custom build from the 0.90.11
tag with the following commits cherry picked onto my branch:
ad1097f1ba109d6cb235ba541251ba63abb27c16
b4ec18814b3eeb35d948c01abec3e04745f57458
93e9d2146e77f6c0523875b93c768ab7f81cfe04
6056d2cb5553e6277b023e6860739847fbd95bb7

I tried to minimize changes from the latest released version to the subset
that deal with changes for managing references. I'll see how this works.

Also, we were able to get some log information (client side) to find out
more details. The problems started occurring around 4:16am EST and
continued until 4:45am EST. We noticed issues started with a bulk index
request (not sure if that helps at all).

On Tuesday, February 11, 2014 4:21:12 AM UTC-5, simonw wrote:

Jay, first of all good that this prevents the server from going into an
infinite loop! Can you maybe build elasticsearch from the 0.90 branch (mvn
clean -DskipTests packge) and deploy the artifact that is in
target/release/ - we added some stuff to prevent this can you maybe give
it a go if that fixes your problem?

simon

On Monday, February 10, 2014 4:47:51 PM UTC+1, Jay Modi wrote:

I took the latest patch from the Lucene issue and patched the 4.6.0 tag
and started using that jar in an instance (running 0.90.8). It looks like
that instance hit the condition that previously would have caused the
infinite loop. Our logs filled up with exception messages about failing to
acquire a searcher. I have created a gist of a portion of the log file with
all of the failed to acquire messages
https://gist.github.com/jaymode/a73a9695e31d7ec53f3f

This appears to have gone on for about 15 to 30 minutes and then it looks
as though a flush occurred and the errors went away, see the following
gist: https://gist.github.com/jaymode/421919a3471f8358d5d4

Doing some log analysis I have received these counts of log messages:

Count Message
272,190 [Collector 3] [elements][2] failed to acquire searcher, source
search
2 [Collector 3] [elements][2] failed to acquire searcher,
source search_factory
1 [Collector 3] [elements][2] failed to prepare/warm

1 [Collector 3] [elements][2] Failed to close current
SearcherManager

It is interesting that it failed to prepare/warm because we do not
explicitly call any index warmer.

Is there something to be done here so the logs do not fill up with these
messages? We save about 550mb of logs and we churned through that much in
about 15 minutes when this happened, so I do not have log message from the
beginning of this event.
On Thursday, February 6, 2014 3:18:14 PM UTC-5, simonw wrote:

Thanks so much for the infos here! That is really helpful! Just to keep
you updated, I committed a fix for the infinite loop to lucene 5 min ago (
https://issues.apache.org/jira/browse/LUCENE-5436) and we are working
on porting the fixes to ES as we speak. We are adding additional checks to
ES core to prevent this as much as possible. We will keep you updated.
Nevertheless I guess upgrading to the latest version 0.90.11 would be good
for you there are lots of sneaky things that are fixed!

simon

On Thursday, February 6, 2014 8:59:52 PM UTC+1, JBartak wrote:

On our case were were running on CentOS 6.4

  • We aren't deleting by query.
  • We are running parent child queries (although not a lot of them and
    maybe not at the time this happened). We do regularly run nested queries.
  • We didn't see any exceptions in the logs.
  • Not sure what you mean by embedding ElasticSearch, but we were
    running a 3 machine ElasticSearch cluster in Amazon and using no-data nodes
    to connect to it using the Java API.
  • We have not reproduced the problem
  • It happened while running 90.7, but we haven't upgraded past that yet.
  • We are not calling flush or updating engine settings.

On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:

Hey folks,

we are trying to get to the bottom of this and I'd want to have some
more infos here. Can you provide us with more insight in what you are doing
with ES. What I'd be interested in is:

  • are you using delete by query
  • are you running parent child queries
  • are you seeing any exceptions in the logs
  • are you embedding elasticsearch
  • can you reproduce the problem and if so what are you doing to do so.
  • can you tell me the latest version you have see this happening? I
    could see some problems in 0.90.7 that are fixed in 0.90.8 though.
  • are you forcefully flushing ie. calling flush via the API or update
    engine settings of any kind?

simon

On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:

Additionally, I took a heap dump from the ES 0.90.7 instance and I
believe I see why the code was stuck in acquire.

I looked at all of the SearcherManager instance and the index readers
since it looks like that is what the code is trying to increment a count in
when it acquires one. There were 24 readers in total associated with
SearcherManager instances. Every reader except one had a refCount of 1, one
had a refCount of 0. Now if there is a race condition while the code
is in acquire where the refCount goes down to 0 (reader closes), while the
acquire method is being called the loop will run forever as tryIncRef in a
IndexReader will always return false if the refCount is not greater than 0.

Hope that makes sense. FWIW my understanding of the code could be
wrong and something else could be going on; if clarification or more
information from the heap dump/jstack is needed, please let me know.

On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:

No we are not running any plugins. Our application uses the Java API
to query elasticsearch.

We also had this happen on another instance running ES 0.20.2 with
JDK6. Restarting "fixes" it but we only run a single node so when this
happens everything with our application stops working.

On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:

On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.com wrote:

Coincidentally we hit a similar issue on Windows yesterday with
0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.

This looks bad. Are you using any particular plugin that may
acquire/release searchers?

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a83dd334-3fab-4ba0-9668-59210b98137b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.