ES 6.x - Timeout by REST API operations

This issue appears on ES 6.7

I've been trying to create an ILM policy to delete old indices. I used first the Kibana Console and got a 500 Internal Server Error back.

So I tried the following curl call on the server:

curl -X PUT "localhost:9200/_ilm/policy/delete_old_indices_policy" -H 'Content-Type: application/json' -d'
{
    "policy": {
        "phases": {
            "hot": {
                "actions": {
                    "set_priority": {
                        "priority": 100
                    }
                }
            },
            "delete": {
                "min_age": "30d",
                "actions": {
                    "delete": {}
                }
            }
        }
    }
}

I got the following error

{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put-lifecycle-delete_old_indices_policy) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put-lifecycle-delete_old_indices_policy) within 30s"},"status":503}

What can I do to create the policy?

EDIT: This issue seems to appear for any kind of curl operation I try to make, not only creating policy.

This part of the error:

Indicates that within 30 seconds, the master node wasn't able to process the cluster state update when the policy was added. This usually means that your cluster is overloaded and unable to process the request.

I increased the resources of the server, but I still get the error... I will have a look at activating Monitoring/Uptime to see if it helps get more information on why it fails.

How many indices and shards do you have in the cluster? If you have a lot this can lead to a large cluster state and slow updates.

I've got about 30 indices for the data I ingest each with 5 primary shards and 5 replica shards (now that I set my second node to be a data node too), also about 20 diverse monitoring and kibana indices each with 1 primary and 1 replica. (Which I hope won't get touched by ILM I hope now that I think about it.)

How many indices and shards in total in the cluster? Have you read this blog post?

Using the Shards and Indices API, I count currently 54 indices and 380 shards (190 primary and 190 replicas). All shards are below 10GB (maximum is 8.8). Cluster countains 2 nodes (1 master/data/ingest, 1 data/ingest).

I just read the blog post, but I'll prolly need to re-read, as it is a lot of information to digest.

If that is the total number of shards and indices and not what is created daily it does not sound excessive.

Yes it is the total number at the moment, containing indices which were created in the last 30 days, as well as the .security, .kibana*, .elastalert_status, labels and ingest indices.

After reading the blog, I'm considering reducing the number of shards in the daily indices though, so they are a bit bigger. maybe using the Shrink API.

Still having a issue, but I've been dealing on another issue about bulk failure not being handled and my hope is that it will help free some compute resources which would have been taken by the ingest/bulk operations.

Well I can't do any changes to pipelines, as I get also timeout to policies, so I'm back to square one.

I've seen the Webinar from @davemoore on sizing and scaling and from my limited understanding, it seems that the bottleneck is with the Compute resource, which would mean if we simplify things I could throw more CPUs at the problem, but one could think that 8 CPUs in a 2 instance cluster would be enough... but I have the feeling I am oversimplifying the issue there.

@Christian_Dahlqvist any further idea I could follow or things I should investigate to have a clearer picture of the issue?

So I can attack the problem from different angle:

  • Give more resources to the nodes:

This is something I already done relatively recently, as I had performance issue concerning querying and ingesting the logs, I increased CPU and RAM resources on both nodes. Now both nodes have 8 CPU and main and second node have respectively 64 and 14GB RAM. Also I increased Heap Size for the elasticsearch deamon to the maximum of 31GB. Storage is far from an issue, as I have more than half the space available on the volumes I mounted to store the elasticsearch data, but then maybe I have to consider the networking aspect because of it's configuration.

  • Add node/redistribute the roles:

I have two nodes a big one and a little one. Both have all the roles, although the little one didn't use to have the data role, but it helped me with the previously mentioned. So now, maybe I need to add a node with a dedicated role, or remove one of the role on the existing node?

  • Update to 7.x

I am currently on 6.7.2, recently updated from 6.3, but maybe I should consider updating?

  • More monitoring/benchmarking

I could try to set up such tool as Rally, Uptime or Monitoring to get more data and identify the bottleneck.

@Christian_Dahlqvist I've just read about the _nodes/hot_threads API and thought it would be useful to find out where the resources are used as it could be useful to determine the bottleneck, however, I have difficulties interpreting the results, here is the summary of the threads for the main node:

74.5% (372.4ms out of 500ms) cpu usage by thread 'elasticsearch[es1][masterService#updateTask][T#1]'
21.7% (108.3ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#3]'
17.4% (87.1ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#4]'
15.5% (77.5ms out of 500ms) cpu usage by thread 'elasticsearch[es1][http_server_worker][T#6]'
15.5% (77.4ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#2]'
13.4% (67.2ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#5]'
10.0% (50ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#6]'
7.8% (38.8ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#1]'
5.7% (28.7ms out of 500ms) cpu usage by thread 'elasticsearch[es1][refresh][T#2]'
5.6% (28.2ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#8]'
4.8% (23.8ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#7]'
4.4% (21.9ms out of 500ms) cpu usage by thread 'elasticsearch[es1][http_server_worker][T#10]'
1.1% (5.2ms out of 500ms) cpu usage by thread 'elasticsearch[es1][management][T#2]'
0.0% (117.5micros out of 500ms) cpu usage by thread 'ticker-schedule-trigger-engine'
0.0% (0s out of 500ms) cpu usage by thread 'ml-cpp-log-tail-thread'
0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[keepAlive/6.7.2]'
0.0% (0s out of 500ms) cpu usage by thread 'DestroyJavaVM'
0.0% (0s out of 500ms) cpu usage by thread 'process reaper'
0.0% (0s out of 500ms) cpu usage by thread 'Connection evictor'

So it seems a lot of write threads. After I try to update a pipeline (with timeout), a new thread appears named 'elasticsearch[es1][clusterApplierService#updateTask][T#1]'. After trying to add an ILM policy the 'elasticsearch[es1][http_server_worker][T#6]'thread takes most of the time.

I've saved the all stack trace in case details for specific threads are needed.

Anyway, as I suspect the bulk and ingest operation to hog the resources (without any concrete proof, relying only on my observation that data seems to be imported in Elasticsearch just fine), I plan to make a scheduled maintenance and disabled data and ingest role on the cluster for a finite time while I'm trying to do a few operations which I hope will help (especially when it comes to absent failiure handling).

Even after I tried to disable the roles, I couldn't execute any REST API call without going to timeout. Slowly it is becoming quite critical that I'm able to do some changes, but I don't know what to do.

Can anybody help me?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.