Stats api is slow sometimes


(Bdharrington7) #1

Hi, We've been seeing strange issues in our cluster for a while now, and I'm hoping this is the harbinger: Sometimes when loading the stats api (localhost:9200/_nodes/stats?all=true) it can take upwards of 5 seconds to 5 minutes to return. I thought maybe this was a cold JVM issue, but throughout testing, I can let it sit for 10 seconds or 10 minutes, and sometimes it would be slow, and sometimes be very responsive. I've checked top for any high cpu usage, but nothing looks strange from there. Here's my setup:

Elasticsearch 1.7.4
Java 1.8.0_45
Running on a Centos 6.4 machine 15 GB RAM, hosted on Openstack
6 data (acting as clients), 3 master

The cluster is barely used, has 100 docs on it, no one querying. Has anyone else experienced this?


(Jason Tedor) #2

When it's hanging, are you able to get a report from the hot threads API, or can you take a stack dump (with jstack) of all of the nodes on your cluster?

How much heap have you allocated to the Elasticsearch process?


(Bdharrington7) #3

Each node has about half the machine's memory allocated to the Elasticsearch heap.


(Bdharrington7) #4

Here's a couple queries to hot_threads during a time it hangs:

[brian@data-04 ~]$ curl -XGET localhost:9200/_nodes/_local/hot_threads
::: [data-04][ascqiAteQRG65CIk78FMbg][data-04.novalocal][inet[/10.xxx.xx.191:9300]]{http.enabled=true, master=false}
   Hot threads at 2016-01-29T00:22:47.680Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    0.0% (167.3micros out of 500ms) cpu usage by thread 'elasticsearch[data-04][transport_client_timer][T#1]{Hashed wheel timer #1}'
     10/10 snapshots sharing following 5 elements
       java.lang.Thread.sleep(Native Method)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
       org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
       java.lang.Thread.run(Thread.java:745)

[brian@data-04 ~]$ curl -XGET localhost:9200/_nodes/_local/hot_threads
::: [data-04][ascqiAteQRG65CIk78FMbg][data-04.novalocal][inet[/10.xxx.xx.191:9300]]{http.enabled=true, master=false}
   Hot threads at 2016-01-29T00:23:00.920Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    0.0% (205.8micros out of 500ms) cpu usage by thread 'elasticsearch[data-04][transport_client_timer][T#1]{Hashed wheel timer #1}'
     10/10 snapshots sharing following 5 elements
       java.lang.Thread.sleep(Native Method)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
       org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
       java.lang.Thread.run(Thread.java:745)

[brian@data-04 ~]$ curl -XGET localhost:9200/_nodes/_local/hot_threads
::: [data-04][ascqiAteQRG65CIk78FMbg][data-04.novalocal][inet[/10.xxx.xx.191:9300]]{http.enabled=true, master=false}
   Hot threads at 2016-01-29T00:23:08.391Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    0.0% (161.8micros out of 500ms) cpu usage by thread 'elasticsearch[data-04][transport_client_timer][T#1]{Hashed wheel timer #1}'
     10/10 snapshots sharing following 5 elements
       java.lang.Thread.sleep(Native Method)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
       org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
       java.lang.Thread.run(Thread.java:745)

[brian@data-04 ~]$ curl -XGET localhost:9200/_nodes/_local/hot_threads
::: [data-04][ascqiAteQRG65CIk78FMbg][data-04.novalocal][inet[/10.xxx.xx.191:9300]]{http.enabled=true, master=false}
   Hot threads at 2016-01-29T00:23:41.041Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

[brian@data-04 ~]$ curl -XGET localhost:9200/_nodes/_local/hot_threads
::: [data-04][ascqiAteQRG65CIk78FMbg][data-04.novalocal][inet[/10.xxx.xx.191:9300]]{http.enabled=true, master=false}
   Hot threads at 2016-01-29T00:23:45.638Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    0.0% (186.9micros out of 500ms) cpu usage by thread 'elasticsearch[data-04][transport_client_timer][T#1]{Hashed wheel timer #1}'
     10/10 snapshots sharing following 5 elements
       java.lang.Thread.sleep(Native Method)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
       org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
       java.lang.Thread.run(Thread.java:745)

[brian@data-04 ~]$ curl -XGET localhost:9200/_nodes/_local/hot_threads
::: [data-04][ascqiAteQRG65CIk78FMbg][data-04.novalocal][inet[/10.xxx.xx.191:9300]]{http.enabled=true, master=false}
   Hot threads at 2016-01-29T00:23:50.133Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    0.0% (188.4micros out of 500ms) cpu usage by thread 'elasticsearch[data-04][transport_client_timer][T#1]{Hashed wheel timer #1}'
     10/10 snapshots sharing following 5 elements
       java.lang.Thread.sleep(Native Method)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
       org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
       java.lang.Thread.run(Thread.java:745)

(Jason Tedor) #5

Okay, that's fine. Any messages in the logs about GC durations? Are you able to get a stack dump?


(Bdharrington7) #6

The hot_threads output is above, but there are no problems with long GC times, additionally there is not a lot of activity on the cluster so there wouldn't be any huge GCs


(system) #7