Kibana: random error "Request Timeout after 30000ms"

Hello.
I run kibana + elasticsearch in a small production environment, and I get this error "Request Timeout after 30000ms" every few query run.
I'm opening a thread here in elasticsearch forum because my guess is that the problem is more an elasticsearch issue than a kibana issue:
from my investigations, elasticsearch randomly stops to reply even to simple requests like this
curl -XGET http://localhost:9200/_cluster/health?pretty , that hangs for ever.

To clarify:

  • I open a terminal on node2
  • after a few requests like curl -XGET http://localhost:9200/_cluster/health?pretty , curl hangs for ever
  • I open another terminal on the same node and run again the curl request while the first curl still running: sometime I get a reply, sometime curls hangs

I found the same behaviour on all 3 nodes.

My setup:

servers: ubuntu 14.04 running on Amazon EC2 m3.large instances (2 CPUs, 7,5 GB RAM)
elasticsearch 5.1.2
kibana 5.1.2

kibana runs on a dedicated server, while elasticsearch runs on 3 other servers.
Each node has 6GB allocated for the heap using -Xms6g -Xmx6g

here the stats for the clusters nodes

> $ curl -XGET 'localhost:9200/_cat/nodes?v'
ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
172.31.11.125           22          99   7    0.37    0.32     0.32 mdi       -      node2.tr.elasticsearch.local
172.31.14.208           58          98  16    0.50    0.33     0.32 mdi       *      node3.tr.elasticsearch.local
172.31.14.207           10          99   1    0.17    0.20     0.25 mdi       -      node4.tr.elasticsearch.local

kibana is configured to use node2, using elasticsearch.url: "http://172.31.11.125:9200" in kibana.yml

ping time between nodes is around 0.3 ms

example:
> $ ping 172.31.14.208
PING 172.31.14.208 (172.31.14.208) 56(84) bytes of data.
64 bytes from 172.31.14.208: icmp_seq=1 ttl=64 time=0.325 ms
64 bytes from 172.31.14.208: icmp_seq=2 ttl=64 time=0.323 ms
64 bytes from 172.31.14.208: icmp_seq=3 ttl=64 time=0.327 ms

elasticsearch.log file don't say nothing.
No cpu spikes are shown when the curl hangs.

Any advice to what to further investigate will be appreciated.

SLL

I still trying to solve this.

Running curl -XGET 'localhost:9200/_tasks?actions=*search&detailed&pretty' on node2 I can see a lot of simple queries that are running for hundreds of seconds.

You can find output on this gist:

search tasks

after about 500 seconds, i finally got an error

$ curl -XGET 'localhost:9200/_tasks?actions=*search&detailed&pretty' 
    {
      "node_failures" : [
        {
          "type" : "failed_node_exception",
          "reason" : "Failed node [D0csBa1rTBuzTbkRj1eD-g]",
          "caused_by" : {
            "type" : "node_not_connected_exception",
            "reason" : "[node3.tr.elasticsearch.local][172.31.14.208:9300] Node not connected"
          }
        }
      ],
      "nodes" : { }
    }

but according to the cluster health api, the cluster status is green. :confused:

$ curl -XGET 'localhost:9200/_cluster/health?pretty'
{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 656,
  "active_shards" : 1312,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

The elasticsearch.log on node2 shows this:

[2017-01-30T11:59:20,261][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [node2.tr.elasticsearch.local] failed to execute on node [D0csBa1rTBuzTbkRj1
eD-g]
org.elasticsearch.transport.SendRequestTransportException: [node3.tr.elasticsearch.local][172.31.14.208:9300][cluster:monitor/nodes/info[n]]
        at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:531) ~[elasticsearch-5.1.2.jar:5.1.2]
        at org.elasticsearch.transport.TransportService$$Lambda$1049/733357076.sendRequest(Unknown Source) ~[?:?]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:465) ~[elasticsearch-5.1.2.jar:5.1.2]
 .....
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [node3.tr.elasticsearch.local][172.31.14.208:9300] Node not connected
        at org.elasticsearch.transport.TcpTransport.nodeChannel(TcpTransport.java:491) ~[elasticsearch-5.1.2.jar:5.1.2]
        at org.elasticsearch.transport.TcpTransport.sendRequest(TcpTransport.java:921) ~[elasticsearch-5.1.2.jar:5.1.2]
        at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:520) ~[elasticsearch-5.1.2.jar:5.1.2]
 .....

but nothing on node3 nor node4 :confused:

Any suggestion on what to investigate further is appreciated

SLL

The recommendation is to give 50% of the available memory to the Java heap, and you have assigned over 75%, which leaves very little left for the operating system file cache, which is important for good performance. That said, it looks like you have quite a large number of shards for a cluster that size. How much data do you hold in the cluster? What is the average shard size? Having a large number of small shards can be very inefficient as each shard is associated with some overhead. Querying across such a large number of shards, especially if this will result in disk I/O against slow storage due to the small page cache, can be slow.

Look at disk IO, e.g. using iostat, as well as CPU usage while you are querying and see if you can see what is limiting performance. Also look in the Elasticsearch logs for signs of long GC or other errors.

Hello.

At the end I solved the connecivities issues following the instructions in this link: http://logz.io/blog/elasticsearch-cluster-disconnects/

To make a short recap: there is a bug in the ubuntu network driver that causes some random packets loss. In the syslog of the server I got messages like this:

xen_netfront: xennet: skb rides the rocket: 19 slots

The procedure to solve the problem is to disable the the Scatter / Gather ability in the network interface.

//To see if scatter-gather is ON on eth0:
ethtool -k eth0 |grep sca
//to set it to off:
ethtool -K eth0 sg off

@Christian_Dahlqvist: I will follow the recommendation regarding the allocated java heap, thank you for noting it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.