I inherited an ELK system, and I am not sure if it ever worked properly, but I know it is not working now.
My system has 4 clusters, for 4 different customers. Each cluster is fronted by an F5 load balancer, so that data from logstash agents gets routed to a member of the appropriate cluster. I have run queries using curl and they appear to run properly, so I believe that ingestion and queries are working properly.
However, queries from kibana fail about 80%of the time. A failed query looks like a message in white letters on a red background that says “Discover: an error occurred with your request. Reset your inputs and try again”.
Looking at the HTTP from the kibana server and the browser, I see that the server is returning a HTTP status code of 504, bad gateway.
When I look in the /var/log/nginx/error.log file, I see many errors of the form
[root@n7-z01-0a2a2723 ~]# fgrep semfs-log.starwave.com /var/log/nginx/error.log
2015/05/22 09:22:39 [error] 20338#0: *973501 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.199.242.119, server: , request: "POST /elasticsearch/_msearch?timeout=0&ignore_unavailable=true&preference=1432311692251 HTTP/1.1", upstream: "http://127.0.0.1:5601/elasticsearch/_msearch?timeout=0&ignore_unavailable=true&preference=1432311692251", host: "semfs-log.starwave.com", referrer: "http://semfs-log.starwave.com/"
[root@n7-z01-0a2a2723 ~]#
By way of contrast, if I use just elasticsearch via curl, it always works perfectly:
[MGMTPROD\silvj170@n7mmadm02 se-mfs]$ curl -XPOST 'n7-z01-0a2a27c8.iaas.starwave.com:9200/_all/_search?pretty' -d '{ "query": { "match": { "host":"n7smtpinadm02.starwave.com" } } }’
{
"took" : 51197,
"timed_out" : false,
"_shards" : {
"total" : 1540,
"successful" : 1540,
"failed" : 0
},
"hits" : {
"total" : 5389758568,
"max_score" : 5.3979974,
"hits" : [ {
"_index" : "msg-logstash-2015.04.23",
"_type" : "logs",
"_id" : "AUzjoo87NG6Sot6lPwYa",
"_score" : 5.3979974,
"_source":{"message":"mtaqid=t3N0A0p3019620, engine=ctengine, from=<servico@solucaodesaude.com.br>, recipients=<erasmo@disney.com.br>, relay=ibm9.revendaversa.com.br [201.33.22.9], size=10555, spam_class=bulk","@version":"1","@timestamp":"2015-04-23T00:14:13.000Z","host":"n7smtpinadm02.starwave.com","program":"MM","thread":"Jilter Processor 29 - Async Jilter Worker 20 - 127.0.0.1:20518-t3N0A0p3019620","loglevel":"INFO","class":"user.log"}
}, {
So I assume that the problem is somewhere in how kibana translates its queries into elasticsearch. One hypothesis that I have is that there is a problem because kibana doesn’t know about the 4 different clusters. I don’t know how to test this. The request_timeout in kibana.yml is 300000 or 5 minutes. I am getting failures much faster that that.
Another hypothesis that I have is that the easticsearch daemons need to be locked in memory. I have been trying to set the mlockall parameter with no success, but it might be a moot point in my case, since top says that I’m not paging:
top - 14:48:39 up 15 days, 2:34, 1 user, load average: 0.08, 0.05, 0.01
Tasks: 102 total, 1 running, 101 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3%us, 0.2%sy, 0.3%ni, 99.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8191792k total, 4241604k used, 3950188k free, 203892k buffers
Swap: 0k total, 0k used, 0k free, 2781356k cached
The elasticsearch servers are running version 1.4.4. I don’t know how to determine the version numbers of the logstash agents. I am running Kibana version 4.0.0. The kibana server and the elasticsearch servers are running on RHEL 6.6.
Thank you
Jeff Silverman