Search thread pools not released

Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.

Currently there is almost no traffic on this cluster. The few queries that
are currently running are either small test queries or large facet queries
(which are infrequent and the longest runs for 16 seconds). What I am
noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).

http://search06:9200/_cat/thread_pool

search05 1.1.1.5 0 0 0 0 0 0 19 0 0
search07 1.1.1.7 0 0 0 0 0 0 0 0 0
search08 1.1.1.8 0 0 0 0 0 0 0 0 0
search09 1.1.1.9 0 0 0 0 0 0 0 0 0
search11 1.1.1.11 0 0 0 0 0 0 0 0 0
search06 1.1.1.6 0 0 0 0 0 0 2 0 0
search10 1.1.1.10 0 0 0 0 0 0 0 0 0
search12 1.1.1.12 0 0 0 0 0 0 0 0 0

In this case, both search05 and search06 have an active thread count that
does not change. If I run a query against search05, the search will respond
quickly and the total number of active search threads does not increase.

So I have two related issues:

  1. the active thread count does not decrease
  2. the cluster will not accept requests if one node becomes unstable.

I have seen the issue intermittently in the past, but the issue has started
again and cluster restarts does not fix the problem. At the log level,
there have been issues with the cluster state not propagating. Not every
node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Can be anything seen in a thread dump what looks like stray queries?
Maybe some facet queries hanged while resources went low and never returned?

Jörg

On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic ivan@brusic.com wrote:

Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.

Currently there is almost no traffic on this cluster. The few queries that
are currently running are either small test queries or large facet queries
(which are infrequent and the longest runs for 16 seconds). What I am
noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).

http://search06:9200/_cat/thread_pool

search05 1.1.1.5 0 0 0 0 0 0 19 0 0
search07 1.1.1.7 0 0 0 0 0 0 0 0 0
search08 1.1.1.8 0 0 0 0 0 0 0 0 0
search09 1.1.1.9 0 0 0 0 0 0 0 0 0
search11 1.1.1.11 0 0 0 0 0 0 0 0 0
search06 1.1.1.6 0 0 0 0 0 0 2 0 0
search10 1.1.1.10 0 0 0 0 0 0 0 0 0
search12 1.1.1.12 0 0 0 0 0 0 0 0 0

In this case, both search05 and search06 have an active thread count that
does not change. If I run a query against search05, the search will respond
quickly and the total number of active search threads does not increase.

So I have two related issues:

  1. the active thread count does not decrease
  2. the cluster will not accept requests if one node becomes unstable.

I have seen the issue intermittently in the past, but the issue has
started again and cluster restarts does not fix the problem. At the log
level, there have been issues with the cluster state not propagating. Not
every node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Forgot to mention the thread dumps. I have taken them before, but not this
time. Most of the block search thead pools are stuck in log4j.

I do have a socket appender to logstash (elasticsearch logs in
elasticsearch!). Let me debug this connection.

--
Ivan

On Sun, Jul 6, 2014 at 1:55 PM, joergprante@gmail.com <joergprante@gmail.com

wrote:

Can be anything seen in a thread dump what looks like stray queries?
Maybe some facet queries hanged while resources went low and never
returned?

Jörg

On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic ivan@brusic.com wrote:

Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.

Currently there is almost no traffic on this cluster. The few queries
that are currently running are either small test queries or large facet
queries (which are infrequent and the longest runs for 16 seconds). What I
am noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).

http://search06:9200/_cat/thread_pool

search05 1.1.1.5 0 0 0 0 0 0 19 0 0
search07 1.1.1.7 0 0 0 0 0 0 0 0 0
search08 1.1.1.8 0 0 0 0 0 0 0 0 0
search09 1.1.1.9 0 0 0 0 0 0 0 0 0
search11 1.1.1.11 0 0 0 0 0 0 0 0 0
search06 1.1.1.6 0 0 0 0 0 0 2 0 0
search10 1.1.1.10 0 0 0 0 0 0 0 0 0
search12 1.1.1.12 0 0 0 0 0 0 0 0 0

In this case, both search05 and search06 have an active thread count that
does not change. If I run a query against search05, the search will respond
quickly and the total number of active search threads does not increase.

So I have two related issues:

  1. the active thread count does not decrease
  2. the cluster will not accept requests if one node becomes unstable.

I have seen the issue intermittently in the past, but the issue has
started again and cluster restarts does not fix the problem. At the log
level, there have been issues with the cluster state not propagating. Not
every node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB-%2BGB1U1c8cgxWDFdV_pmE53_kFe-R1C4AYktHbEHmfA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Yes, socket appender blocks. Maybe the async appender of log4j can do
better ...

http://ricardozuasti.com/2009/asynchronous-logging-with-log4j/

Jörg

On Sun, Jul 6, 2014 at 11:22 PM, Ivan Brusic ivan@brusic.com wrote:

Forgot to mention the thread dumps. I have taken them before, but not this
time. Most of the block search thead pools are stuck in log4j.

https://gist.github.com/brusic/fc12536d8e5706ec9c32

I do have a socket appender to logstash (elasticsearch logs in
elasticsearch!). Let me debug this connection.

--
Ivan

On Sun, Jul 6, 2014 at 1:55 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Can be anything seen in a thread dump what looks like stray queries?
Maybe some facet queries hanged while resources went low and never
returned?

Jörg

On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic ivan@brusic.com wrote:

Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.

Currently there is almost no traffic on this cluster. The few queries
that are currently running are either small test queries or large facet
queries (which are infrequent and the longest runs for 16 seconds). What I
am noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).

http://search06:9200/_cat/thread_pool

search05 1.1.1.5 0 0 0 0 0 0 19 0 0
search07 1.1.1.7 0 0 0 0 0 0 0 0 0
search08 1.1.1.8 0 0 0 0 0 0 0 0 0
search09 1.1.1.9 0 0 0 0 0 0 0 0 0
search11 1.1.1.11 0 0 0 0 0 0 0 0 0
search06 1.1.1.6 0 0 0 0 0 0 2 0 0
search10 1.1.1.10 0 0 0 0 0 0 0 0 0
search12 1.1.1.12 0 0 0 0 0 0 0 0 0

In this case, both search05 and search06 have an active thread count
that does not change. If I run a query against search05, the search will
respond quickly and the total number of active search threads does not
increase.

So I have two related issues:

  1. the active thread count does not decrease
  2. the cluster will not accept requests if one node becomes unstable.

I have seen the issue intermittently in the past, but the issue has
started again and cluster restarts does not fix the problem. At the log
level, there have been issues with the cluster state not propagating. Not
every node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB-%2BGB1U1c8cgxWDFdV_pmE53_kFe-R1C4AYktHbEHmfA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB-%2BGB1U1c8cgxWDFdV_pmE53_kFe-R1C4AYktHbEHmfA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFsQR8baTQNApQFgP2ofDihhN5895mz77LxDPObxM7fgg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Still analyzing all the logs and dumps that I have accumulated so far, but
it looks like the blocking socket appender might be the issue. After that
node exhausts all of its search threads, the TransportClient will still
issue requests to it, although other nodes do not have issues. After a
while, the client application will also be blocked waiting for
Elasticsearch to return.

I removed logging for now, will re-implement it with a service that reads
directly from the duplicate file-based log. Although I have a timeout
specific for my query, my recollection of the search code is that it only
applies to the Lucene LimitedCollector (its been a while since I looked at
that code). The next step should be to add an explicit timeout
to actionGet(). Is the default basically no wait?

It might be a challenge for the cluster engine to not delegate queries to
overloaded servers.

Cheers,

Ivan

On Sun, Jul 6, 2014 at 2:36 PM, joergprante@gmail.com <joergprante@gmail.com

wrote:

Yes, socket appender blocks. Maybe the async appender of log4j can do
better ...

http://ricardozuasti.com/2009/asynchronous-logging-with-log4j/

Jörg

On Sun, Jul 6, 2014 at 11:22 PM, Ivan Brusic ivan@brusic.com wrote:

Forgot to mention the thread dumps. I have taken them before, but not
this time. Most of the block search thead pools are stuck in log4j.

https://gist.github.com/brusic/fc12536d8e5706ec9c32

I do have a socket appender to logstash (elasticsearch logs in
elasticsearch!). Let me debug this connection.

--
Ivan

On Sun, Jul 6, 2014 at 1:55 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Can be anything seen in a thread dump what looks like stray queries?
Maybe some facet queries hanged while resources went low and never
returned?

Jörg

On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic ivan@brusic.com wrote:

Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.

Currently there is almost no traffic on this cluster. The few queries
that are currently running are either small test queries or large facet
queries (which are infrequent and the longest runs for 16 seconds). What I
am noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).

http://search06:9200/_cat/thread_pool

search05 1.1.1.5 0 0 0 0 0 0 19 0 0
search07 1.1.1.7 0 0 0 0 0 0 0 0 0
search08 1.1.1.8 0 0 0 0 0 0 0 0 0
search09 1.1.1.9 0 0 0 0 0 0 0 0 0
search11 1.1.1.11 0 0 0 0 0 0 0 0 0
search06 1.1.1.6 0 0 0 0 0 0 2 0 0
search10 1.1.1.10 0 0 0 0 0 0 0 0 0
search12 1.1.1.12 0 0 0 0 0 0 0 0 0

In this case, both search05 and search06 have an active thread count
that does not change. If I run a query against search05, the search will
respond quickly and the total number of active search threads does not
increase.

So I have two related issues:

  1. the active thread count does not decrease
  2. the cluster will not accept requests if one node becomes unstable.

I have seen the issue intermittently in the past, but the issue has
started again and cluster restarts does not fix the problem. At the log
level, there have been issues with the cluster state not propagating. Not
every node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB-%2BGB1U1c8cgxWDFdV_pmE53_kFe-R1C4AYktHbEHmfA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB-%2BGB1U1c8cgxWDFdV_pmE53_kFe-R1C4AYktHbEHmfA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFsQR8baTQNApQFgP2ofDihhN5895mz77LxDPObxM7fgg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFsQR8baTQNApQFgP2ofDihhN5895mz77LxDPObxM7fgg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDVDNsG0RjmoHk3djiR-f1R8sWNnj4-Xe4XSBR6116eEQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Yes, actionGet() can be traced down to AbstractQueueSynchronizer's
acquireSharedInterruptibly(-1) call

http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/AbstractQueuedSynchronizer.html#acquireSharedInterruptibly(int)

in org.elasticsearch.common.util.concurrent.BaseFuture which "waits"
forever until interrupted. But there are twin methods, like actionGet(long
millis), that time out.

Jörg

On Mon, Jul 7, 2014 at 7:53 PM, Ivan Brusic ivan@brusic.com wrote:

Still analyzing all the logs and dumps that I have accumulated so far, but
it looks like the blocking socket appender might be the issue. After that
node exhausts all of its search threads, the TransportClient will still
issue requests to it, although other nodes do not have issues. After a
while, the client application will also be blocked waiting for
Elasticsearch to return.

I removed logging for now, will re-implement it with a service that reads
directly from the duplicate file-based log. Although I have a timeout
specific for my query, my recollection of the search code is that it only
applies to the Lucene LimitedCollector (its been a while since I looked at
that code). The next step should be to add an explicit timeout
to actionGet(). Is the default basically no wait?

It might be a challenge for the cluster engine to not delegate queries to
overloaded servers.

Cheers,

Ivan

On Sun, Jul 6, 2014 at 2:36 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Yes, socket appender blocks. Maybe the async appender of log4j can do
better ...

http://ricardozuasti.com/2009/asynchronous-logging-with-log4j/

Jörg

On Sun, Jul 6, 2014 at 11:22 PM, Ivan Brusic ivan@brusic.com wrote:

Forgot to mention the thread dumps. I have taken them before, but not
this time. Most of the block search thead pools are stuck in log4j.

https://gist.github.com/brusic/fc12536d8e5706ec9c32

I do have a socket appender to logstash (elasticsearch logs in
elasticsearch!). Let me debug this connection.

--
Ivan

On Sun, Jul 6, 2014 at 1:55 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Can be anything seen in a thread dump what looks like stray queries?
Maybe some facet queries hanged while resources went low and never
returned?

Jörg

On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic ivan@brusic.com wrote:

Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.

Currently there is almost no traffic on this cluster. The few queries
that are currently running are either small test queries or large facet
queries (which are infrequent and the longest runs for 16 seconds). What I
am noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).

http://search06:9200/_cat/thread_pool

search05 1.1.1.5 0 0 0 0 0 0 19 0 0
search07 1.1.1.7 0 0 0 0 0 0 0 0 0
search08 1.1.1.8 0 0 0 0 0 0 0 0 0
search09 1.1.1.9 0 0 0 0 0 0 0 0 0
search11 1.1.1.11 0 0 0 0 0 0 0 0 0
search06 1.1.1.6 0 0 0 0 0 0 2 0 0
search10 1.1.1.10 0 0 0 0 0 0 0 0 0
search12 1.1.1.12 0 0 0 0 0 0 0 0 0

In this case, both search05 and search06 have an active thread count
that does not change. If I run a query against search05, the search will
respond quickly and the total number of active search threads does not
increase.

So I have two related issues:

  1. the active thread count does not decrease
  2. the cluster will not accept requests if one node becomes unstable.

I have seen the issue intermittently in the past, but the issue has
started again and cluster restarts does not fix the problem. At the log
level, there have been issues with the cluster state not propagating. Not
every node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB-%2BGB1U1c8cgxWDFdV_pmE53_kFe-R1C4AYktHbEHmfA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB-%2BGB1U1c8cgxWDFdV_pmE53_kFe-R1C4AYktHbEHmfA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFsQR8baTQNApQFgP2ofDihhN5895mz77LxDPObxM7fgg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFsQR8baTQNApQFgP2ofDihhN5895mz77LxDPObxM7fgg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDVDNsG0RjmoHk3djiR-f1R8sWNnj4-Xe4XSBR6116eEQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDVDNsG0RjmoHk3djiR-f1R8sWNnj4-Xe4XSBR6116eEQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGbYocet5UT-D5175aCrZTU-2o%3DuKtS6uz_di-LL-e_GA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Yeah, already traced it back myself. Been using Elasticsearch for years and
I have been only setting query timeouts. Need to re-architect a way to
incorporate client-based timeouts.

Had two different elasticsearch meltdowns this weekend, after a long period
of stability. Both of them different and unique!

--
Ivan

On Mon, Jul 7, 2014 at 1:50 PM, joergprante@gmail.com <joergprante@gmail.com

wrote:

Yes, actionGet() can be traced down to AbstractQueueSynchronizer's
acquireSharedInterruptibly(-1) call

http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/AbstractQueuedSynchronizer.html#acquireSharedInterruptibly(int)

in org.elasticsearch.common.util.concurrent.BaseFuture which "waits"
forever until interrupted. But there are twin methods, like actionGet(long
millis), that time out.

Jörg

On Mon, Jul 7, 2014 at 7:53 PM, Ivan Brusic ivan@brusic.com wrote:

Still analyzing all the logs and dumps that I have accumulated so far,
but it looks like the blocking socket appender might be the issue. After
that node exhausts all of its search threads, the TransportClient will
still issue requests to it, although other nodes do not have issues. After
a while, the client application will also be blocked waiting for
Elasticsearch to return.

I removed logging for now, will re-implement it with a service that reads
directly from the duplicate file-based log. Although I have a timeout
specific for my query, my recollection of the search code is that it only
applies to the Lucene LimitedCollector (its been a while since I looked at
that code). The next step should be to add an explicit timeout
to actionGet(). Is the default basically no wait?

It might be a challenge for the cluster engine to not delegate queries to
overloaded servers.

Cheers,

Ivan

On Sun, Jul 6, 2014 at 2:36 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Yes, socket appender blocks. Maybe the async appender of log4j can do
better ...

http://ricardozuasti.com/2009/asynchronous-logging-with-log4j/

Jörg

On Sun, Jul 6, 2014 at 11:22 PM, Ivan Brusic ivan@brusic.com wrote:

Forgot to mention the thread dumps. I have taken them before, but not
this time. Most of the block search thead pools are stuck in log4j.

https://gist.github.com/brusic/fc12536d8e5706ec9c32

I do have a socket appender to logstash (elasticsearch logs in
elasticsearch!). Let me debug this connection.

--
Ivan

On Sun, Jul 6, 2014 at 1:55 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Can be anything seen in a thread dump what looks like stray queries?
Maybe some facet queries hanged while resources went low and never
returned?

Jörg

On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic ivan@brusic.com wrote:

Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.

Currently there is almost no traffic on this cluster. The few queries
that are currently running are either small test queries or large facet
queries (which are infrequent and the longest runs for 16 seconds). What I
am noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).

http://search06:9200/_cat/thread_pool

search05 1.1.1.5 0 0 0 0 0 0 19 0 0
search07 1.1.1.7 0 0 0 0 0 0 0 0 0
search08 1.1.1.8 0 0 0 0 0 0 0 0 0
search09 1.1.1.9 0 0 0 0 0 0 0 0 0
search11 1.1.1.11 0 0 0 0 0 0 0 0 0
search06 1.1.1.6 0 0 0 0 0 0 2 0 0
search10 1.1.1.10 0 0 0 0 0 0 0 0 0
search12 1.1.1.12 0 0 0 0 0 0 0 0 0

In this case, both search05 and search06 have an active thread count
that does not change. If I run a query against search05, the search will
respond quickly and the total number of active search threads does not
increase.

So I have two related issues:

  1. the active thread count does not decrease
  2. the cluster will not accept requests if one node becomes unstable.

I have seen the issue intermittently in the past, but the issue has
started again and cluster restarts does not fix the problem. At the log
level, there have been issues with the cluster state not propagating. Not
every node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.

Cheers,

Ivan

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCx91LEXP0NxbgC4-mVR27DX%2BuOxyor5cqiM6ie2JExBw%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH3%2Bxxu-yY_cE3Q-2mVvyzRW%3DTKq2GFJ_rnVSSOj-w%3DbA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB-%2BGB1U1c8cgxWDFdV_pmE53_kFe-R1C4AYktHbEHmfA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB-%2BGB1U1c8cgxWDFdV_pmE53_kFe-R1C4AYktHbEHmfA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFsQR8baTQNApQFgP2ofDihhN5895mz77LxDPObxM7fgg%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFsQR8baTQNApQFgP2ofDihhN5895mz77LxDPObxM7fgg%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDVDNsG0RjmoHk3djiR-f1R8sWNnj4-Xe4XSBR6116eEQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDVDNsG0RjmoHk3djiR-f1R8sWNnj4-Xe4XSBR6116eEQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGbYocet5UT-D5175aCrZTU-2o%3DuKtS6uz_di-LL-e_GA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGbYocet5UT-D5175aCrZTU-2o%3DuKtS6uz_di-LL-e_GA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQD5myr4s-TM5LmTKPd5ywG72ZpMmwegqKZcq3N-xkGWEQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi, were you able to find a solution to this problem?

I have a 5 node cluster (2 masters, 3 members) and one of the node is pegged because of a search queue that wont go down. I think the problem is because I am making S3 repo calls to a repo that has an underscore in the name and S3 doesn't like this.

The search queue on my pegged node is at 6257 and it won't drop.

I am going to make changes to the repo setup and calls to remove the underscore but I cant get the one node to drop it's search queue and hence the whole cluster has become completely unresponsive to search.

I tried to stop/start the node but the search queue just moves to another node. I am on version 1.7.3 and I'm willing to update the cluster but I need to dislodge this search queue first so I can test the upgrade before rolling into production.

Thanks for any advice.