Kibana queries fail about 80% of the time. Server says upstream timed out

I inherited an ELK system, and I am not sure if it ever worked properly, but I know it is not working now.

My system has 4 clusters, for 4 different customers. Each cluster is fronted by an F5 load balancer, so that data from logstash agents gets routed to a member of the appropriate cluster. I have run queries using curl and they appear to run properly, so I believe that ingestion and queries are working properly.

However, queries from kibana fail about 80%of the time. A failed query looks like a message in white letters on a red background that says “Discover: an error occurred with your request. Reset your inputs and try again”.
Looking at the HTTP from the kibana server and the browser, I see that the server is returning a HTTP status code of 504, bad gateway.
When I look in the /var/log/nginx/error.log file, I see many errors of the form

[root@n7-z01-0a2a2723 ~]# fgrep semfs-log.starwave.com /var/log/nginx/error.log
2015/05/22 09:22:39 [error] 20338#0: *973501 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.199.242.119, server: , request: "POST /elasticsearch/_msearch?timeout=0&ignore_unavailable=true&preference=1432311692251 HTTP/1.1", upstream: "http://127.0.0.1:5601/elasticsearch/_msearch?timeout=0&ignore_unavailable=true&preference=1432311692251", host: "semfs-log.starwave.com", referrer: "http://semfs-log.starwave.com/"
[root@n7-z01-0a2a2723 ~]# 

By way of contrast, if I use just elasticsearch via curl, it always works perfectly:

[MGMTPROD\silvj170@n7mmadm02 se-mfs]$ curl -XPOST 'n7-z01-0a2a27c8.iaas.starwave.com:9200/_all/_search?pretty' -d '{ "query": { "match": { "host":"n7smtpinadm02.starwave.com" } } }’
{
  "took" : 51197,
  "timed_out" : false,
  "_shards" : {
    "total" : 1540,
    "successful" : 1540,
    "failed" : 0
  },
  "hits" : {
    "total" : 5389758568,
    "max_score" : 5.3979974,
    "hits" : [ {
      "_index" : "msg-logstash-2015.04.23",
      "_type" : "logs",
      "_id" : "AUzjoo87NG6Sot6lPwYa",
      "_score" : 5.3979974,
      "_source":{"message":"mtaqid=t3N0A0p3019620, engine=ctengine, from=<servico@solucaodesaude.com.br>, recipients=<erasmo@disney.com.br>, relay=ibm9.revendaversa.com.br [201.33.22.9], size=10555, spam_class=bulk","@version":"1","@timestamp":"2015-04-23T00:14:13.000Z","host":"n7smtpinadm02.starwave.com","program":"MM","thread":"Jilter Processor 29 - Async Jilter Worker 20 - 127.0.0.1:20518-t3N0A0p3019620","loglevel":"INFO","class":"user.log"}
    }, {

So I assume that the problem is somewhere in how kibana translates its queries into elasticsearch. One hypothesis that I have is that there is a problem because kibana doesn’t know about the 4 different clusters. I don’t know how to test this. The request_timeout in kibana.yml is 300000 or 5 minutes. I am getting failures much faster that that.

Another hypothesis that I have is that the easticsearch daemons need to be locked in memory. I have been trying to set the mlockall parameter with no success, but it might be a moot point in my case, since top says that I’m not paging:

top - 14:48:39 up 15 days,  2:34,  1 user,  load average: 0.08, 0.05, 0.01
Tasks: 102 total,   1 running, 101 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3%us,  0.2%sy,  0.3%ni, 99.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8191792k total,  4241604k used,  3950188k free,   203892k buffers
Swap:        0k total,        0k used,        0k free,  2781356k cached

The elasticsearch servers are running version 1.4.4. I don’t know how to determine the version numbers of the logstash agents. I am running Kibana version 4.0.0. The kibana server and the elasticsearch servers are running on RHEL 6.6.

Thank you

Jeff Silverman

I have continued to read the documentation. The kibana server is a tribe node, and it has the tribes and their members enumerated in elasticsearch.yml.

It could be the F5's timing out, what are the values set there?

I tried bypassing the LTMs and connecting directly to the tribal node. It makes little difference.

I have been doing more reading - ELK is very complicated under the hood - and I have discovered that kibana generates a list of fields. There is checkbox labeled "hide missing fields". I unchecked that box, and behold! I get a lengthy list of fields that are missing. I see this list both when going through the LTM and when bypassing the LTM. I understand that these fields are present in the elasticsearch mapping but not in any documents in the search results. However, I am confused. If I use an elasticsearch query looking for those fields, I find documents that contain them.

Thank you

Jeff

I thought that maybe the elasticsearch servers were getting overloaded, even though you wouldn't know it by looking at top. So I turned off ingestion by disabling the F5 LTM VIPs that the logstash agents point to. I also changed the F5 LTM health check monitoring interval from 5 seconds to 60 seconds. That slowed down the rate at which the file /data/logs/prod/kibana/kibana.out.log was written to, so I could see the following error messages:

{"@timestamp":"2015-05-27T22:31:37.681Z","level":"error","message":"ESOCKETTIMEDOUT","node_env":"production","error":{"message":"ESOCKETTIMEDOUT","name":"Error","stack":"Error: ESOCKETTIMEDOUT\n  at ClientRequest.<anonymous> (/data/apps/prod/kibana-4.0.0-linux-x64/src/node_modules/request/request.js:920:19)\n  at ClientRequest.g (events.js:180:16)\n  at ClientRequest.emit (events.js:92:17)\n  at Socket.emitTimeout (http.js:1802:10)\n  at Socket.g (events.js:180:16)\n  at Socket.emit (events.js:92:17)\n  at Socket._onTimeout (net.js:327:8)\n  at _makeTimerTimeout (timers.js:429:11)\n  at Timer.unrefTimeout [as ontimeout] (timers.js:493:5)\n","code":"ESOCKETTIMEDOUT"}}
{"@timestamp":"2015-05-27T22:31:37.682Z","level":"info","message":"POST /_msearch?timeout=0&ignore_unavailable=true&preference=1432764503773 502 - 300005ms","node_env":"production","request":{"method":"POST","url":"/elasticsearch/_msearch?timeout=0&ignore_unavailable=true&preference=1432764503773","headers":{"host":"semfs-log.starwave.com","x-forwarded-for":"10.199.233.210, 10.199.242.113","x-forwarded-proto":"http","connection":"close","content-length":"708","accept":"application/json, text/plain, */*","origin":"http://semfs-log.starwave.com","user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36","content-type":"application/json;charset=UTF-8","dnt":"1","referer":"http://semfs-log.starwave.com/?","accept-encoding":"gzip, deflate","accept-language":"en-US,en;q=0.8"},"remoteAddress":"127.0.0.1","remotePort":34839},"response":{"statusCode":502,"responseTime":300005,"contentLength":49}}
    {"@timestamp":"2015-05-27T22:13:24.751Z","level":"error","message":"ETIMEDOUT","node_env":"production","error":{"message":"ETIMEDOUT","name":"Error","stack":"Error: ETIMEDOUT\n  at [object Object]._onTimeout (/data/apps/prod/kibana-4.0.0-linux-x64/src/node_modules/request/request.js:909:15)\n  at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)\n","code":"ETIMEDOUT"}}

I did a little googling and found only one hit, https://github.com/elastic/kibana/issues/3112, which I think is irrelevant. In my case, kibana is getting something from elasticsearch, otherwise, it couldn't know about the missing fields, could it?

Jeff

I learned something. I had been using very broad patterns for indexs in the index pattern, such as -logstash-. I narrowed the pattern to a specific index, msg-logstash-2015.05.27, and that seems to have solved the problem for most queries. I am not getting a pretty consistent 90% success rate.

1 Like

Thanks for the research. I ran into the exact same issue and the fix turned out to be changing the way index names are resolved in Kibana.

Earlier, I was using "logstash-". However, under "Settings" what you have to do is check both:
"Index contains time-based events" and "Use event times to create index names". Then select the field that reflects timestamp (by default, @timestamp). Finally, make sure you set this index setting as default and if you have an index setting for "logstash-
" then delete it.