Save Objects API often results in 503 Timeout

Elasticsearch version: 7.10.0
Kibana version: 7.10.0

All POST API calls to the endpoint https://kibana-our-host.cisco.com/api/saved_objects/query/Example%20Query
result in the following 503 Error, no matter the content of the query:

error: "Service Unavailable"
message: "Request timed out"
statusCode: 503

As soon as we get a 503 Error, if you click the "Save Query" button again, we get the below 409 (Conflict) Error. This tells us that the previous attempt to save our query actually worked, despite returning a 503. If you view your Saved Queries, we can also verify that the first attempt to save worked fine.

error: "Conflict"
message: "[query:Example Query]: version conflict, document already exists (current version [1]): version_conflict_engine_exception"
statusCode: 409

Elasticsearch does not show any signs of memory/cpu pressure, it does not have any search or index times in excess of a couple seconds, yet Kibana shows that the Client Response times are regularly shooting up to 30000ms (ie 503 Timeout error), which is how long our Kibana timeout is.

Other Kibana activities such as querying data works totally fine. It is really just POST calls to /api/saved_objects that are exhibiting this behavior.

I have been unable to debug, and am wondering if there are any known bugs that could explain this. Thoughts I had were maybe the .kibana index was corrupted in some way, it is related to the version, or our ingress controller to Kibana is resulting in the timeout (although I looked at our performance metrics and it does not appear to be the case).

Edit: This cluster is deployed and managed via the ECK Operator (image docker.elastic.co/eck/eck-operator:1.2.1).

I'll move this to the Kibana category, as you're calling that API and that's where the issue seems to be.

However, is there a reason you are calling that for your query and not talking to Elasticsearch directly?

Thanks warkolm. I am not making external calls to that endpoint. All of these calls are made directly via the Kibana UI. I have attached and circled in Red the button that facilitates the "Save Query" flow.

Another thing perhaps worth noting, these are all of our .kibana indices. We created our initial elasticsearch cluster/kibana instaces at version 7.9.0 and have since upgraded them to 7.10.0. I believe that having two .kibana indices is due to the ugprade itself. Do I need to perform some manual cleanup on the old kibana data?

I was able to solve the issue. I changed my cluster's refresh_interval to 60s early on. I didn't think much of it since 60s worked well for the data I was storing. It turns out that the Saved Objects APIs are highly reliant on this setting. I added a new Index Template setting the refresh_interval to 1s for the pattern .kibana* with a higher priority, and the Save Objects APIs are responsive again.

I would recommend Elasticsearch & Kibana ship with a reasonable default setting to address this. It is common practice to override the cluster's default refresh_interval, but most people do not want to sacrifice the performance of Kibana.

Edit: Another solution could be for the Kibana UI to force a refresh during document indexing.

The default is actually 1s.

I understand that; but it is so common practice to override the cluster wide refresh_interval that I think Elasticsearch should ship with an explicit setting for all .kibana* indices. If a user chose to change the explicit setting or override it with a higher order index template, then the user would in theory at least have an idea of what is happening in their cluster. Elastic's own documentation suggests increasing this value for a whole array of reasons, without ever mentioning that this is a side-effect. A minimalist solution would be for the Kibana/Saved Objects API documentation to call this out.

Like I mentioned, forcing a refresh of the .kibana index every index operation also sounds like a reasonable solution with the frequency the Saved Objects API is used.

But regardless, this is not how create APIs should work in my opinion. Why would creating an object ever result in a 503 Timeout if the create operation is successful? This implies that the Saved Object API logic is simply a wrapper on Elasticsearch queries.. an index operation followed by a get operation. The series of operations will always require a refresh_interval of ~1s in order to have a responsive Kibana interface. Instead of doing this, return the 20X status code for a successful create when it happens, or come up with a better way to keep Kibana running smoothly.

I do not believe this dependency was intentional.

Ok, yes that's a fair point :slight_smile:

Are you willing to raise a GitHub issue for this?

I could do that. I just don't want others to run into this behavior if at all possible since Kibana is used by so many teams.