Elasticsearch timeouts and Kibana Discover: socket hang up


(James Tighe) #1

Hi Guys,

I am hoping someone can help me with my issue.

I have a 2 node production cluster of ES 5.0 and a standalone Kibana 5.0 server to access the data.

ES itself seems to be running well and the indexes are being created from my Syslog listener in Logstash.

I have issues however with Kibana getting Discover: Timeout exceeded 30000ms when trying to query data ranges older than say 4 hours.

I have increased the elasticsearch.timeout to a much higher amount and no instead of the Discover: Timeout exceeded I am getting Discover: Socket hang up and no results.

As stated ES is happily indexing and the monitoring show this to be the case.

The only way to resolve this issue is to reboot the cluster, which is obviously not something I want to do all the time.

For reference I have very large indexes (around 8GB for each node). And currently I have an index pattern for billinglog-* (so we can search all dates easily).

If I instead use a daily index pattern, such as billinglog-2016.11.16 I can seemingly search the full index without the error but only for some of the indexes. However some indexes still fail with the socket hang up error.

Is this simply an issue with the amount of data in the index? Or are there settings in ES that need to be set to allow faster querying of the indexes?

My specs for my ES servers are 4 CPU, 16GB with the ES_HEAP set to 8GB (as advised to be half the available memory)

Marvel isn't showing and memory issues so I don't think it is the machine spec necessarily.

Any help would be appreciated as we can't use the cluster if we can't view the data.

Thanks

James


Socket hangs up while running Timelion
(Ed) #2

Increase your timeout setting in kibana

elasticsearch.requestTimeout:

https://www.elastic.co/guide/en/kibana/5.0/settings.html


(James Tighe) #3

I have already done that.

Once I did that I no longer got the Timeout exceeded error.

This is when the Discover: Socket hang up started happening.

Its really annoying as I can't see why it is failing to get response.


(Ed) #4

Have to go to the basics first

  • Any errors in Elasticsearch during the problem
  • Any errors in Kibana

If you can not search some indexes it sounds like the cluster is unhealthy ,

when your having problems do you have unallocated shards? Is the cluster in Green/OK state at the time

If you just restart kibana does that fix the problem.


(Ed) #5

How is your disk space ?


(James Tighe) #6

Hi,

My cluster is healthy and all shards are assigned.

Disk available 861GB / 982GB
Indices: 15
Disk Usage: 69GB

19 Shards on each node

I can't see any errors in the Elasticsearch log file but the Kibana log shows the following when searching 24 hours time period for the billing-2016.11.17

{"type":"response","@timestamp":"2016-11-17T13:14:39Z","tags":[],"pid":30847,"method":"post","statusCode":502,"req":{"url":"/elasticsearch/_msearch","method":"post","headers":{"host":"################:5601","connection":"keep-alive","content-length":"737","accept":"application/json, text/plain, */*","origin":"http://##########:5601","kbn-version":"5.0.0","user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.44 Safari/537.36","content-type":"application/x-ldjson","referer":"http://##########:5601/app/kibana","accept-encoding":"gzip, deflate","accept-language":"en-US,en;q=0.8"},"remoteAddress":"##########","userAgent":"##########","referer":"http://##########:5601/app/kibana"},"res":{"statusCode":502,"responseTime":31955,"contentLength":9},"message":"POST /elasticsearch/_msearch 502 31955ms - 9.0B"}

Therefore it looks like a 502 bad gateway response from the ES cluster.

If I search the same index for only a 4 hour period it succeeds and I see the following

> {"type":"response","@timestamp":"2016-11-17T13:27:36Z","tags":[],"pid":30847,"method":"post","statusCode":200,"req":{"url":"/elasticsearch/_msearch","method":"post","headers":{"host":"thmanpemkib01.cobwebmanage.local:5601","connection":"keep-alive","content-length":"737","accept":"application/json, text/plain, */*","origin":"http://thmanpemkib01.cobwebmanage.local:5601","kbn-version":"5.0.0","user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.44 Safari/537.36","content-type":"application/x-ldjson","referer":"http://thmanpemkib01.cobwebmanage.local:5601/app/kibana","accept-encoding":"gzip, deflate","accept-language":"en-US,en;q=0.8"},"remoteAddress":"10.0.49.105","userAgent":"10.0.49.105","referer":"http://thmanpemkib01.cobwebmanage.local:5601/app/kibana"},"res":{"statusCode":200,"responseTime":31260,"contentLength":9},"message":"POST /elasticsearch/_msearch 200 31260ms - 9.0B"}

So it seems that I can search only search the live index back 4 hours. Any more and the error happen.

If I manually load the index pattern for the day before I can fully search without issues.


(Ed) #7

Ok interesting, lets rule out kibana as the 30000 setting sounds like it is the timeout but if it is set then lets go to the low level

Can you issue the same query from Curl curl http://localhost:9200/_search?timeout=60000 -d '{INSERT QUERY HERE}'

Try without the timout setting first. so we just test the defaults. then if that still times out try higher numbers

I would also check your IO stats during the query time IOTOP works well,

Oh you are really over sharded IMO with only 2 nodes (and 4 cpu's each) but 19 shards that is a lot of threading and Memory overhead. I would keep your sharding down around 5, your probably having problems Map Reducing across all the shards (IE the cpu can't keep up)

What is your load average during the any +4hour query?


(James Tighe) #8

Sorry I meant 19 shards in total for all indices . . . my bad.

As it is a 2 node cluster I have gone with 1 primary and 1 replica for now to (keep storage down). As this is not live yet I can always increase the shards later on.

So the actual shard amount is 2 per index.

I will have a look at the IO during the query in case it is the disks. The difference in response time between the 200 and 502 response is quite large so something is amiss.

I have ran the query locally on an ES node and get the following.

{
  "took" : 739,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 70512674,
    "max_score" : 1.0,
    "hits" : [

That was running the query below in SENSE (the same results from CURL)

GET _search 
{
  "query": {
    "range" : {
      "@timestamp": {
        "gte": "now-24h",
        "lte": "now"
      }
    }
  }
}

Now even more confused as it seems ES is fine from the query and has 81302299 hits.

Therefore it looks like Kibana may be the culprit.

I am running Kibana from behind a NGINX reverse proxy if this would make any difference.


(James Tighe) #9

Also for reference the Load Average went up to 2.01

and CPU raised to 18.67%

My CPU usage is
38.33% max 6% min - NODE 1 (PRIMARY)
50.33% max 5% min - NODE 2

JVM Memory is
27% max 24% min - NODE 1
22% max 18% min - NODE 2


(Ed) #10

Well I run apache infront of kibana and the "Mod_proxy" has a 30's timeout it could be breaking your connection

I had to sync that up with the Kabana settings of 5 minutes. I have 20B records in my cluster


(James Tighe) #11

You sir are a genius.

I had completely forgotten the keepalive_timeout option in my nginx.conf.

I have upped it to match the elasticsearch.requesttimeout and it is now giving me the results.

I probably wouldn't have thought of that either

Thanks


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.