Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.
Currently there is almost no traffic on this cluster. The few queries that
are currently running are either small test queries or large facet queries
(which are infrequent and the longest runs for 16 seconds). What I am
noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).
In this case, both search05 and search06 have an active thread count that
does not change. If I run a query against search05, the search will respond
quickly and the total number of active search threads does not increase.
So I have two related issues:
the active thread count does not decrease
the cluster will not accept requests if one node becomes unstable.
I have seen the issue intermittently in the past, but the issue has started
again and cluster restarts does not fix the problem. At the log level,
there have been issues with the cluster state not propagating. Not every
node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.
Can be anything seen in a thread dump what looks like stray queries?
Maybe some facet queries hanged while resources went low and never returned?
Jörg
On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic ivan@brusic.com wrote:
Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.
Currently there is almost no traffic on this cluster. The few queries that
are currently running are either small test queries or large facet queries
(which are infrequent and the longest runs for 16 seconds). What I am
noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).
In this case, both search05 and search06 have an active thread count that
does not change. If I run a query against search05, the search will respond
quickly and the total number of active search threads does not increase.
So I have two related issues:
the active thread count does not decrease
the cluster will not accept requests if one node becomes unstable.
I have seen the issue intermittently in the past, but the issue has
started again and cluster restarts does not fix the problem. At the log
level, there have been issues with the cluster state not propagating. Not
every node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.
Can be anything seen in a thread dump what looks like stray queries?
Maybe some facet queries hanged while resources went low and never
returned?
Jörg
On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic ivan@brusic.com wrote:
Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.
Currently there is almost no traffic on this cluster. The few queries
that are currently running are either small test queries or large facet
queries (which are infrequent and the longest runs for 16 seconds). What I
am noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).
In this case, both search05 and search06 have an active thread count that
does not change. If I run a query against search05, the search will respond
quickly and the total number of active search threads does not increase.
So I have two related issues:
the active thread count does not decrease
the cluster will not accept requests if one node becomes unstable.
I have seen the issue intermittently in the past, but the issue has
started again and cluster restarts does not fix the problem. At the log
level, there have been issues with the cluster state not propagating. Not
every node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.
Can be anything seen in a thread dump what looks like stray queries?
Maybe some facet queries hanged while resources went low and never
returned?
Jörg
On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic ivan@brusic.com wrote:
Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.
Currently there is almost no traffic on this cluster. The few queries
that are currently running are either small test queries or large facet
queries (which are infrequent and the longest runs for 16 seconds). What I
am noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).
In this case, both search05 and search06 have an active thread count
that does not change. If I run a query against search05, the search will
respond quickly and the total number of active search threads does not
increase.
So I have two related issues:
the active thread count does not decrease
the cluster will not accept requests if one node becomes unstable.
I have seen the issue intermittently in the past, but the issue has
started again and cluster restarts does not fix the problem. At the log
level, there have been issues with the cluster state not propagating. Not
every node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.
Still analyzing all the logs and dumps that I have accumulated so far, but
it looks like the blocking socket appender might be the issue. After that
node exhausts all of its search threads, the TransportClient will still
issue requests to it, although other nodes do not have issues. After a
while, the client application will also be blocked waiting for
Elasticsearch to return.
I removed logging for now, will re-implement it with a service that reads
directly from the duplicate file-based log. Although I have a timeout
specific for my query, my recollection of the search code is that it only
applies to the Lucene LimitedCollector (its been a while since I looked at
that code). The next step should be to add an explicit timeout
to actionGet(). Is the default basically no wait?
It might be a challenge for the cluster engine to not delegate queries to
overloaded servers.
Can be anything seen in a thread dump what looks like stray queries?
Maybe some facet queries hanged while resources went low and never
returned?
Jörg
On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic ivan@brusic.com wrote:
Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.
Currently there is almost no traffic on this cluster. The few queries
that are currently running are either small test queries or large facet
queries (which are infrequent and the longest runs for 16 seconds). What I
am noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).
In this case, both search05 and search06 have an active thread count
that does not change. If I run a query against search05, the search will
respond quickly and the total number of active search threads does not
increase.
So I have two related issues:
the active thread count does not decrease
the cluster will not accept requests if one node becomes unstable.
I have seen the issue intermittently in the past, but the issue has
started again and cluster restarts does not fix the problem. At the log
level, there have been issues with the cluster state not propagating. Not
every node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.
in org.elasticsearch.common.util.concurrent.BaseFuture which "waits"
forever until interrupted. But there are twin methods, like actionGet(long
millis), that time out.
Jörg
On Mon, Jul 7, 2014 at 7:53 PM, Ivan Brusic ivan@brusic.com wrote:
Still analyzing all the logs and dumps that I have accumulated so far, but
it looks like the blocking socket appender might be the issue. After that
node exhausts all of its search threads, the TransportClient will still
issue requests to it, although other nodes do not have issues. After a
while, the client application will also be blocked waiting for
Elasticsearch to return.
I removed logging for now, will re-implement it with a service that reads
directly from the duplicate file-based log. Although I have a timeout
specific for my query, my recollection of the search code is that it only
applies to the Lucene LimitedCollector (its been a while since I looked at
that code). The next step should be to add an explicit timeout
to actionGet(). Is the default basically no wait?
It might be a challenge for the cluster engine to not delegate queries to
overloaded servers.
Can be anything seen in a thread dump what looks like stray queries?
Maybe some facet queries hanged while resources went low and never
returned?
Jörg
On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic ivan@brusic.com wrote:
Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.
Currently there is almost no traffic on this cluster. The few queries
that are currently running are either small test queries or large facet
queries (which are infrequent and the longest runs for 16 seconds). What I
am noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).
In this case, both search05 and search06 have an active thread count
that does not change. If I run a query against search05, the search will
respond quickly and the total number of active search threads does not
increase.
So I have two related issues:
the active thread count does not decrease
the cluster will not accept requests if one node becomes unstable.
I have seen the issue intermittently in the past, but the issue has
started again and cluster restarts does not fix the problem. At the log
level, there have been issues with the cluster state not propagating. Not
every node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.
Yeah, already traced it back myself. Been using Elasticsearch for years and
I have been only setting query timeouts. Need to re-architect a way to
incorporate client-based timeouts.
Had two different elasticsearch meltdowns this weekend, after a long period
of stability. Both of them different and unique!
in org.elasticsearch.common.util.concurrent.BaseFuture which "waits"
forever until interrupted. But there are twin methods, like actionGet(long
millis), that time out.
Jörg
On Mon, Jul 7, 2014 at 7:53 PM, Ivan Brusic ivan@brusic.com wrote:
Still analyzing all the logs and dumps that I have accumulated so far,
but it looks like the blocking socket appender might be the issue. After
that node exhausts all of its search threads, the TransportClient will
still issue requests to it, although other nodes do not have issues. After
a while, the client application will also be blocked waiting for
Elasticsearch to return.
I removed logging for now, will re-implement it with a service that reads
directly from the duplicate file-based log. Although I have a timeout
specific for my query, my recollection of the search code is that it only
applies to the Lucene LimitedCollector (its been a while since I looked at
that code). The next step should be to add an explicit timeout
to actionGet(). Is the default basically no wait?
It might be a challenge for the cluster engine to not delegate queries to
overloaded servers.
Can be anything seen in a thread dump what looks like stray queries?
Maybe some facet queries hanged while resources went low and never
returned?
Jörg
On Sun, Jul 6, 2014 at 9:59 PM, Ivan Brusic ivan@brusic.com wrote:
Having an issue on one of my clusters running version 1.1.1 with 8
master/data nodes, unicast, connecting via the Java TransportClient. A few
REST queries are executed via monitoring services.
Currently there is almost no traffic on this cluster. The few queries
that are currently running are either small test queries or large facet
queries (which are infrequent and the longest runs for 16 seconds). What I
am noticing is that the active search threads on some noded never decreases
and when it reaches the limit, the entire cluster will stop accepting
requests. The current max is the default (3 x 8).
In this case, both search05 and search06 have an active thread count
that does not change. If I run a query against search05, the search will
respond quickly and the total number of active search threads does not
increase.
So I have two related issues:
the active thread count does not decrease
the cluster will not accept requests if one node becomes unstable.
I have seen the issue intermittently in the past, but the issue has
started again and cluster restarts does not fix the problem. At the log
level, there have been issues with the cluster state not propagating. Not
every node will acknowledge the cluster state ([discovery.zen.publish ]
received cluster state version NNN) and the master would log a timeout
(awaiting all nodes to process published state NNN timed out, timeout 30s).
The nodes are fine and I can ping each other with no issues. Currently not
seeing any log errors with the thread pool issue, so perhaps it is a red
herring.
Hi, were you able to find a solution to this problem?
I have a 5 node cluster (2 masters, 3 members) and one of the node is pegged because of a search queue that wont go down. I think the problem is because I am making S3 repo calls to a repo that has an underscore in the name and S3 doesn't like this.
The search queue on my pegged node is at 6257 and it won't drop.
I am going to make changes to the repo setup and calls to remove the underscore but I cant get the one node to drop it's search queue and hence the whole cluster has become completely unresponsive to search.
I tried to stop/start the node but the search queue just moves to another node. I am on version 1.7.3 and I'm willing to update the cluster but I need to dislodge this search queue first so I can test the upgrade before rolling into production.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.