Unable to resolve Gateway Timeout errors

At our organization, we have 7 primary shard Elastic Search servers serving a single Kibana frontend.

I have set request_timeout in the Kibana settings to 600000, and yet I still get these Kibana "Gateway Timeout" exceptions. The stack trace I see in my browser for Kibana is:

Error: Gateway Timeout
    at respond (https://hostname.com/index.js?_b=7489:85288:15)
    at checkRespForFailure (https://hostname.com/index.js?_b=7489:85256:7)
    at https://hostname.com/index.js?_b=7489:83894:7
    at wrappedErrback (https://hostname.com/index.js?_b=7489:20902:78)
    at wrappedErrback (https://hostname.com/index.js?_b=7489:20902:78)
    at wrappedErrback (https://hostname.com/index.js?_b=7489:20902:78)
    at https://hostname.com/index.js?_b=7489:21035:76
    at Scope.$eval (https://hostname.com/index.js?_b=7489:22022:28)
    at Scope.$digest (https://hostname.com/index.js?_b=7489:21834:31)
    at Scope.$apply (https://hostname.com/index.js?_b=7489:22126:24)

We can't seem to get rid of this issue. We are running all of these instances in EC2 in a VPC on Amazon Web Services. All 7 Elastic Search servers are behind an Elastic Load Balancer which forwards traffic to Elastic Search over port 9200. If I repeatedly curl the ELB, I can see that I'm taken round-robin through the list of hosts.

All of the Elastic Search servers respond to requests, so I don't think that one server failing is causing Kibana to fail hard.

Why am I getting this Gateway Timeout exception and what can I do to fix the problem?

What versions of Kibana and Elasticsearch are you running?

Kibana is the latest, 4.1.1. Elastic Search is 1.5.0, we haven't upgraded to 1.6.0.

Internally, NGINX throws a 499, which means that:

499 Client Closed Request (Nginx)
Used in Nginx logs to indicate when the connection has been closed by client while the server is still processing its request, making server unable to send a status code back.

The client sees a 504, though this is internally a 499 in NGINX.

I'm not sure who the client is, in this case. Is NGINX closing the connection between itself and the local Kibana? Is the browser terminating its connection to NGINX? Is Kibana's connection to the ELB being terminated? Kibana's own logs are pretty unhelpful here, nothing is logged.

We set the connection idle timeout and the connection draining policy to 300 seconds on the ELB, which is 5 minutes, and the request fails well before 5 minutes.

The client that is logging this error is the elasticsearch.js client. It is receiving a 504 status code from the browser, so I'm really don't think there is anything you can do in Kibana to make this work.

What browser are you using? Have you checked the network traffic reported by the browser (Chrome exposes a network debugging panel for instance)? Sometimes the text of the response can provide a clue about what part of the stack terminated the response.

When the kibana server (the one shipping with 4.1.1 at least) receives an error from elasticsearch it may respond with a 502 but since you're getting a 504 it must be coming from AWS, or something sitting between the kibana server and AWS.

Hi,

We are currently using Amazon elasticsearch service and Kibana 4.1.0 on ec2 instance.

I set the request_timeout in kibana config settings to a large number 41100000 and I am still getting the gateway timeout exception as below:

Error: Gateway Timeout
at respond (http://hostname:5601/index.js?_b=7616:86367:15)
at checkRespForFailure (http://hostname:5601/index.js?_b=7616:86335:7)
at http://hostname:5601/index.js?_b=7616:84973:7
at wrappedErrback (http://hostname:5601/index.js?_b=7616:20902:78)
at wrappedErrback (http://hostname:5601/index.js?_b=7616:20902:78)
at wrappedErrback (http://hostname:5601/index.js?_b=7616:20902:78)
at http://hostname:5601/index.js?_b=7616:21035:76
at Scope.$eval (http://hostname:5601/index.js?_b=7616:22022:28)
at Scope.$digest (http://hostname:5601/index.js?_b=7616:21834:31)
at Scope.$apply (http://hostname:5601/index.js?_b=7616:22126:24)

We noticed that this is error is occurring if we have too many aggregations on the dashboards across large documents. For our current requirement, we want to get the results even if it takes more time. Could you please tell us why we have this issue and how to resolve it?

did you get an answer to your situation ? it seem I have the same problem aswell

Thank you

Hi gh0stid,

Yes. Elasticsearch needs lot of memory to perform the aggregations so even if the cluster health is green, we happened to have the same shard issues at our end on Kibana. We solved for it by upgrading our ec2 instance. But, this is only a temporary solution as we cannot upgrade the cluster forever(the cost would be high too).
I then took the snapshot of our cluster, stored the backup on S3 and purged the data older than 2 months. This helped us to resolve for the shard failures and also save our data. Whenever we want the data from S3, we could always restore this data in a new cluster and perform the aggregations as usual.

I had the same issue. The solution is to increase the ELB idle timeout to something above 60s, which is the default. The instructions on how to do that are here: https://aws.amazon.com/blogs/aws/elb-idle-timeout-control/

When the ELB times out, upstream from the ELB you will see a 499 error (Client Closed Connection) and downstream, you will see a 504 error (Gateway Timeout)

1 Like