Those commits look good to me! I'd be super curious what you see in the
logs especially coming from this:
logger.warn("Searcher was released twice", new
The bulk index does it contain delete by query or so?
On Tuesday, February 11, 2014 1:53:02 PM UTC+1, Jay Modi wrote:
Simon, prior to this post I decided to try a custom build from the 0.90.11
tag with the following commits cherry picked onto my branch:
I tried to minimize changes from the latest released version to the subset
that deal with changes for managing references. I'll see how this works.
Also, we were able to get some log information (client side) to find out
more details. The problems started occurring around 4:16am EST and
continued until 4:45am EST. We noticed issues started with a bulk index
request (not sure if that helps at all).
On Tuesday, February 11, 2014 4:21:12 AM UTC-5, simonw wrote:
Jay, first of all good that this prevents the server from going into an
infinite loop! Can you maybe build elasticsearch from the 0.90 branch (mvn
clean -DskipTests packge) and deploy the artifact that is in
target/release/ - we added some stuff to prevent this can you maybe give
it a go if that fixes your problem?
On Monday, February 10, 2014 4:47:51 PM UTC+1, Jay Modi wrote:
I took the latest patch from the Lucene issue and patched the 4.6.0 tag
and started using that jar in an instance (running 0.90.8). It looks like
that instance hit the condition that previously would have caused the
infinite loop. Our logs filled up with exception messages about failing to
acquire a searcher. I have created a gist of a portion of the log file with
all of the failed to acquire messages
This appears to have gone on for about 15 to 30 minutes and then it
looks as though a flush occurred and the errors went away, see the
following gist: https://gist.github.com/jaymode/421919a3471f8358d5d4
Doing some log analysis I have received these counts of log messages:
272,190 [Collector 3] [elements] failed to acquire searcher, source
2 [Collector 3] [elements] failed to acquire searcher,
1 [Collector 3] [elements] failed to prepare/warm
1 [Collector 3] [elements] Failed to close current
It is interesting that it failed to prepare/warm because we do not
explicitly call any index warmer.
Is there something to be done here so the logs do not fill up with these
messages? We save about 550mb of logs and we churned through that much in
about 15 minutes when this happened, so I do not have log message from the
beginning of this event.
On Thursday, February 6, 2014 3:18:14 PM UTC-5, simonw wrote:
Thanks so much for the infos here! That is really helpful! Just to keep
you updated, I committed a fix for the infinite loop to lucene 5 min ago (
https://issues.apache.org/jira/browse/LUCENE-5436) and we are working
on porting the fixes to ES as we speak. We are adding additional checks to
ES core to prevent this as much as possible. We will keep you updated.
Nevertheless I guess upgrading to the latest version 0.90.11 would be good
for you there are lots of sneaky things that are fixed!
On Thursday, February 6, 2014 8:59:52 PM UTC+1, JBartak wrote:
On our case were were running on CentOS 6.4
- We aren't deleting by query.
- We are running parent child queries (although not a lot of them and
maybe not at the time this happened). We do regularly run nested queries.
- We didn't see any exceptions in the logs.
- Not sure what you mean by embedding ElasticSearch, but we were
running a 3 machine ElasticSearch cluster in Amazon and using no-data nodes
to connect to it using the Java API.
- We have not reproduced the problem
- It happened while running 90.7, but we haven't upgraded past that
- We are not calling flush or updating engine settings.
On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:
we are trying to get to the bottom of this and I'd want to have some
more infos here. Can you provide us with more insight in what you are doing
with ES. What I'd be interested in is:
- are you using delete by query
- are you running parent child queries
- are you seeing any exceptions in the logs
- are you embedding elasticsearch
- can you reproduce the problem and if so what are you doing to do
- can you tell me the latest version you have see this happening? I
could see some problems in 0.90.7 that are fixed in 0.90.8 though.
- are you forcefully flushing ie. calling flush via the API or
update engine settings of any kind?
On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:
Additionally, I took a heap dump from the ES 0.90.7 instance and I
believe I see why the code was stuck in acquire.
I looked at all of the SearcherManager instance and the index
readers since it looks like that is what the code is trying to increment a
count in when it acquires one. There were 24 readers in total associated
with SearcherManager instances. Every reader except one had a refCount of
1, one had a refCount of 0. Now if there is a race condition while
the code is in acquire where the refCount goes down to 0 (reader closes),
while the acquire method is being called the loop will run forever as
tryIncRef in a IndexReader will always return false if the refCount is not
greater than 0.
Hope that makes sense. FWIW my understanding of the code could be
wrong and something else could be going on; if clarification or more
information from the heap dump/jstack is needed, please let me know.
On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
No we are not running any plugins. Our application uses the Java
API to query elasticsearch.
We also had this happen on another instance running ES 0.20.2 with
JDK6. Restarting "fixes" it but we only run a single node so when this
happens everything with our application stops working.
On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi jay...@gmail.comwrote:
Coincidentally we hit a similar issue on Windows yesterday with
0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my search
threads were in ReferenceManager.acquire(). Looking at the Lucene code for
acquire there is a "busy wait" while it tries to increment a reference. The
system has stayed in this state and subsequent jstack dumps show the same
behavior with all search threads in acquire. Prior to the lockup a index
had completed and the lockup with a few searches and count queries.
This looks bad. Are you using any particular plugin that may
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to email@example.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bf9bb845-c30a-47f6-928c-f1a814c0e06e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.