Timeouts on Node Stats API?


(Xiao Yu) #1

Hello,

We have a cluster that's still running on 0.90.9 and it's recently
developed an interesting issue. Random (data) nodes within our cluster will
occasionally stop responding to the node stats API and we see errors like
the following in our cluster logs on the master node.

Mar 12 20:53:17 esm1.global.search.iad.wordpress.com [2014-03-12
20:53:17,945][DEBUG][action.admin.cluster.node.stats]
[esm1.global.search.iad.wordpress.com] failed to execute on node
[CBIR6UWfSvqPIHOSgJ3c2Q]
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[es3.global.search.iad.wordpress.com][inet[66.155.9.130/66.155.9.130:9300]][cluster/nodes/stats/n]
request_id [12395955] timed out after [15001ms]
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at
java.lang.Thread.run(Thread.java:724)

While this is happening our cluster appears to function "normally", queries
made to the problematic node process and return normally and judging by the
eth traffic and load on the box it appears to even be handling queries and
even rebalancing shards. The only way to solve this problem appears to be
to reboot the node at which point the node disconnects then rejoins the
cluster and functions like any other node.

Mar 12 21:02:45 esm1.global.search.iad.wordpress.com [2014-03-12
21:02:45,991][INFO ][action.admin.cluster.node.shutdown]
[esm1.global.search.iad.wordpress.com] [partial_cluster_shutdown]:
requested, shutting down [[CBIR6UWfSvqPIHOSgJ3c2Q]] in [1s]
Mar 12 21:02:46 esm1.global.search.iad.wordpress.com [2014-03-12
21:02:46,995][INFO ][action.admin.cluster.node.shutdown]
[esm1.global.search.iad.wordpress.com] [partial_cluster_shutdown]: done
shutting down [[CBIR6UWfSvqPIHOSgJ3c2Q]]
...
Mar 12 21:03:40 esm1.global.search.iad.wordpress.com
org.elasticsearch.transport.NodeDisconnectedException:
[es3.global.search.iad.wordpress.com][inet[66.155.9.130/66.155.9.130:9300]][indices/stats/s]
disconnected
...
Mar 12 21:03:41 esm1.global.search.iad.wordpress.com [2014-03-12
21:03:41,045][INFO ][cluster.service ]
[esm1.global.search.iad.wordpress.com] removed
{[es3.global.search.iad.wordpress.com][CBIR6UWfSvqPIHOSgJ3c2Q][inet[66.155.9.130/66.155.9.130:9300]]{dc=iad,
parity=1, master=false},}, reason:
zen-disco-node_left([es3.global.search.iad.wordpress.com][CBIR6UWfSvqPIHOSgJ3c2Q][inet[/66.155.9.130:9300]]{dc=iad,
parity=1, master=false})
Mar 12 21:04:29 esm1.global.search.iad.wordpress.com [2014-03-12
21:04:29,077][INFO ][cluster.service ]
[esm1.global.search.iad.wordpress.com] added
{[es3.global.search.iad.wordpress.com][4Nj-a0SxRTuagLelneThQg][inet[/66.155.9.130:9300]]{dc=iad,
parity=1, master=false},}, reason: zen-disco-receive(join from
node[[es3.global.search.iad.wordpress.com][4Nj-a0SxRTuagLelneThQg][inet[/66.155.9.130:9300]]{dc=iad,
parity=1, master=false}])

All this sometimes causes shards to relocate or go into a recovery state
needlessly.

Nothing appears in the problematic node's logs aside from some slow queries
similar to what's on all the other data nodes and looking at our monitoring
systems there are no anomalies in CPU / memory / thread usage.

Any ideas as to what else I should check?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/888077be-cfbc-4682-ad55-dd051caec878%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Xiao Yu) #2

After restarting nodes I'm also getting a bunch of errors for calls to the
index stats API after the node has come back up. Seems like there's some
issue here where stats API calls fails, does not time out and causes a
backup of other calls until a thread pool is full?

Mar 13 12:22:31 esm1.global.search.sat.wordpress.com [2014-03-13
12:22:31,063][DEBUG][action.admin.indices.stats]
[esm1.global.search.sat.wordpress.com] [test-lang-analyzers-0][18],
node[pnlbpsoZRlWbVfgIPSb9vg], [R], s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.stats.IndicesStatsRequest@35193b8f]
Mar 13 12:22:31 esm1.global.search.sat.wordpress.com
org.elasticsearch.transport.NodeDisconnectedException:
[es4.global.search.sat.wordpress.com][inet[es4.global.search.sat.wordpress.com/76.74.248.144:9300]][indices/stats/s]
disconnected

It looks like these requests gets put into the "MANAGEMENThttps://github.com/elasticsearch/elasticsearch/search?q=ThreadPool.Names.MANAGEMENT&type=Code"
thread pool which we've left at the default configs.

On Wednesday, March 12, 2014 5:13:35 PM UTC-4, Xiao Yu wrote:

Hello,

We have a cluster that's still running on 0.90.9 and it's recently
developed an interesting issue. Random (data) nodes within our cluster will
occasionally stop responding to the node stats API and we see errors like
the following in our cluster logs on the master node.

Mar 12 20:53:17 esm1.global.search.iad.wordpress.com [2014-03-12
20:53:17,945][DEBUG][action.admin.cluster.node.stats] [
esm1.global.search.iad.wordpress.com] failed to execute on node
[CBIR6UWfSvqPIHOSgJ3c2Q]
Mar 12 20:53:17 esm1.global.search.iad.wordpress.comorg.elasticsearch.transport.ReceiveTimeoutTransportException: [
es3.global.search.iad.wordpress.com][inet[
66.155.9.130/66.155.9.130:9300]][cluster/nodes/stats/nhttp://66.155.9.130/66.155.9.130:9300]][cluster/nodes/stats/n]
request_id [12395955] timed out after [15001ms]
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Mar 12 20:53:17 esm1.global.search.iad.wordpress.com at
java.lang.Thread.run(Thread.java:724)

While this is happening our cluster appears to function "normally",
queries made to the problematic node process and return normally and
judging by the eth traffic and load on the box it appears to even be
handling queries and even rebalancing shards. The only way to solve this
problem appears to be to reboot the node at which point the node
disconnects then rejoins the cluster and functions like any other node.

Mar 12 21:02:45 esm1.global.search.iad.wordpress.com [2014-03-12
21:02:45,991][INFO ][action.admin.cluster.node.shutdown] [
esm1.global.search.iad.wordpress.com] [partial_cluster_shutdown]:
requested, shutting down [[CBIR6UWfSvqPIHOSgJ3c2Q]] in [1s]
Mar 12 21:02:46 esm1.global.search.iad.wordpress.com [2014-03-12
21:02:46,995][INFO ][action.admin.cluster.node.shutdown] [
esm1.global.search.iad.wordpress.com] [partial_cluster_shutdown]: done
shutting down [[CBIR6UWfSvqPIHOSgJ3c2Q]]
...
Mar 12 21:03:40 esm1.global.search.iad.wordpress.comorg.elasticsearch.transport.NodeDisconnectedException: [
es3.global.search.iad.wordpress.com][inet[
66.155.9.130/66.155.9.130:9300]][indices/stats/shttp://66.155.9.130/66.155.9.130:9300]][indices/stats/s]
disconnected
...
Mar 12 21:03:41 esm1.global.search.iad.wordpress.com [2014-03-12
21:03:41,045][INFO ][cluster.service ] [
esm1.global.search.iad.wordpress.com] removed {[
es3.global.search.iad.wordpress.com][CBIR6UWfSvqPIHOSgJ3c2Q][inet[
66.155.9.130/66.155.9.130:9300]]{dc=iadhttp://66.155.9.130/66.155.9.130:9300]]{dc=iad,
parity=1, master=false},}, reason: zen-disco-node_left([
es3.global.search.iad.wordpress.com][CBIR6UWfSvqPIHOSgJ3c2Q][inet[/66.155.9.130:9300]]{dc=iad,
parity=1, master=false})
Mar 12 21:04:29 esm1.global.search.iad.wordpress.com [2014-03-12
21:04:29,077][INFO ][cluster.service ] [
esm1.global.search.iad.wordpress.com] added {[
es3.global.search.iad.wordpress.com][4Nj-a0SxRTuagLelneThQg][inet[/66.155.9.130:9300]]{dc=iad,
parity=1, master=false},}, reason: zen-disco-receive(join from node[[
es3.global.search.iad.wordpress.com][4Nj-a0SxRTuagLelneThQg][inet[/66.155.9.130:9300]]{dc=iad,
parity=1, master=false}])

All this sometimes causes shards to relocate or go into a recovery state
needlessly.

Nothing appears in the problematic node's logs aside from some slow
queries similar to what's on all the other data nodes and looking at our
monitoring systems there are no anomalies in CPU / memory / thread usage.

Any ideas as to what else I should check?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/31ff381d-fea9-48b2-9d7a-c3ccfcf69e60%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Binh Ly-2) #3

Can you do a hot_threads while this is happening?

curl "localhost:9200/_nodes/hot_threads"

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da6c8c6e-eac5-4b92-a923-a1438607581b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Xiao Yu) #4

Can you do a hot_threads while this is happening?

I took a couple samples from 2 nodes that were experiencing this issue:

It seems like the problematic nodes are just doing normal searching
operations?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/009a0765-17e8-40b3-8fab-6b7456b34d60%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Xiao Yu) #5

After restarting the node I see logs like the following gist which seems to
suggest there are some internal networking issues perhaps?

https://gist.github.com/xyu/9541662

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bc0181ec-e33b-4a13-a4dc-7e7d68801177%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Xiao Yu) #6

Can you do a hot_threads while this is happening?

Just for good measure I also checked hot threads for blocking and waiting,
nothing interesting there either.

:::
[es1.global.search.sat.wordpress.com][7fiNB_thTk-GRDKe4yQITA][inet[/76.74.248.134:9300]]{dc=sat,
parity=1, master=false}

0.0% (0s out of 500ms) wait usage by thread 'Finalizer'
 10/10 snapshots sharing following 4 elements
   java.lang.Object.wait(Native Method)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
   java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

0.0% (0s out of 500ms) wait usage by thread 'Signal Dispatcher'
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot

0.0% (0s out of 500ms) wait usage by thread 

'elasticsearch[es1.global.search.sat.wordpress.com][[timer]]'
10/10 snapshots sharing following 2 elements
java.lang.Thread.sleep(Native Method)

org.elasticsearch.threadpool.ThreadPool$EstimatedTimeThread.run(ThreadPool.java:511)

:::
[es1.global.search.sat.wordpress.com][7fiNB_thTk-GRDKe4yQITA][inet[/76.74.248.134:9300]]{dc=sat,
parity=1, master=false}

0.0% (0s out of 500ms) block usage by thread 'Finalizer'
 10/10 snapshots sharing following 4 elements
   java.lang.Object.wait(Native Method)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
   java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

0.0% (0s out of 500ms) block usage by thread 'Signal Dispatcher'
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot

0.0% (0s out of 500ms) block usage by thread 

'elasticsearch[es1.global.search.sat.wordpress.com][[timer]]'
10/10 snapshots sharing following 2 elements
java.lang.Thread.sleep(Native Method)

org.elasticsearch.threadpool.ThreadPool$EstimatedTimeThread.run(ThreadPool.java:511)

While all this is happening indexing operations start to see curl timeouts
in our application logs.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7539ba57-f25c-4439-aac7-ea91712982e1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #7

Anything in the logs or slow logs? You're sure slow GCs aren't impacting
performance?

On 14 March 2014 15:17, Xiao Yu me@xyu.io wrote:

Can you do a hot_threads while this is happening?

Just for good measure I also checked hot threads for blocking and waiting,
nothing interesting there either.

::: [es1.global.search.sat.wordpress.com
][7fiNB_thTk-GRDKe4yQITA][inet[/76.74.248.134:9300]]{dc=sat, parity=1,
master=false}

0.0% (0s out of 500ms) wait usage by thread 'Finalizer'
 10/10 snapshots sharing following 4 elements
   java.lang.Object.wait(Native Method)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
   java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

0.0% (0s out of 500ms) wait usage by thread 'Signal Dispatcher'
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot

0.0% (0s out of 500ms) wait usage by thread 'elasticsearch[

es1.global.search.sat.wordpress.com][[timer]]'
10/10 snapshots sharing following 2 elements
java.lang.Thread.sleep(Native Method)

org.elasticsearch.threadpool.ThreadPool$EstimatedTimeThread.run(ThreadPool.java:511)

::: [es1.global.search.sat.wordpress.com
][7fiNB_thTk-GRDKe4yQITA][inet[/76.74.248.134:9300]]{dc=sat, parity=1,
master=false}

0.0% (0s out of 500ms) block usage by thread 'Finalizer'
 10/10 snapshots sharing following 4 elements
   java.lang.Object.wait(Native Method)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
   java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
   java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:189)

0.0% (0s out of 500ms) block usage by thread 'Signal Dispatcher'
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot
 unique snapshot

0.0% (0s out of 500ms) block usage by thread 'elasticsearch[

es1.global.search.sat.wordpress.com][[timer]]'
10/10 snapshots sharing following 2 elements
java.lang.Thread.sleep(Native Method)

org.elasticsearch.threadpool.ThreadPool$EstimatedTimeThread.run(ThreadPool.java:511)

While all this is happening indexing operations start to see curl timeouts
in our application logs.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7539ba57-f25c-4439-aac7-ea91712982e1%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/7539ba57-f25c-4439-aac7-ea91712982e1%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKS_m013zVwBhQ9ru2SeAAEiNCpWO7ruoC94WeovxxYAJg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Xiao Yu) #8

Anything in the logs or slow logs? You're sure slow GCs aren't impacting
performance?

There's nothing in the logs on the node other slow queries before or during
the problematic period. The slow query logs show the same types of queries
that we know to be slow (MLT queries with function rescores) and are shown.
As a percentage of queries executed the there is no bump in number of slow
queried before or during the problematic period. (Yes during the
problematic period the broken node actually executes and returns query
requests, it's not clear to me if it's simply routing queries to other
nodes or if it's shards are actually executing queries as well.)

In addition before the problem occurs there is no increase in ES threads,
heap or non heap memory use and the number of GC cycles remained consistent
at about once every 2 seconds on the node. There are no long GC cycles and
the node never drops from the cluster. During the problematic period the
cluster reports that it's in a green state however all of our logging
indicates that no indexing operations complete cluster wide (we should be
seeing 100-500 / sec under normal load).

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/da65980b-d027-4ec2-9576-1f51a3dc6037%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Xiao Yu) #9

I still don't have any definitive logs or traces that point to the exact
cause of this situation but it appears to be some weird scheduling bug with
hyper threading. Our nodes are running on OpenJDK 7u25 with hyper threaded
CPUs which caused ES to report 2x the number of "available_processors" as
physical cores. After setting the processors setting to define the physical
number of cores we are no longer seeing this issue.

We been running in this configuration all weekend with no reoccurrence
where as we were seeing this issue every couple hours before.

--Xiao

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2b0f6226-272b-46e4-9f07-2e46764d9d9b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #10

Good to hear! Thanks for reporting back

On 17 March 2014 15:04, Xiao Yu me@xyu.io wrote:

I still don't have any definitive logs or traces that point to the exact
cause of this situation but it appears to be some weird scheduling bug with
hyper threading. Our nodes are running on OpenJDK 7u25 with hyper threaded
CPUs which caused ES to report 2x the number of "available_processors" as
physical cores. After setting the processors setting to define the
physical number of cores we are no longer seeing this issue.

We been running in this configuration all weekend with no reoccurrence
where as we were seeing this issue every couple hours before.

--Xiao

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2b0f6226-272b-46e4-9f07-2e46764d9d9b%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/2b0f6226-272b-46e4-9f07-2e46764d9d9b%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKQ9bk8ZqMMkOzsst%2B_OpW_QRjiMZ1D-Q%3D%3DU0Yw39o0UtQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #11