ES 6.x - Timeout by REST API operations

EldrosKandar · May 13, 2019, 9:29am

This issue appears on ES 6.7

I've been trying to create an ILM policy to delete old indices. I used first the Kibana Console and got a 500 Internal Server Error back.

So I tried the following curl call on the server:

curl -X PUT "localhost:9200/_ilm/policy/delete_old_indices_policy" -H 'Content-Type: application/json' -d'
{
    "policy": {
        "phases": {
            "hot": {
                "actions": {
                    "set_priority": {
                        "priority": 100
                    }
                }
            },
            "delete": {
                "min_age": "30d",
                "actions": {
                    "delete": {}
                }
            }
        }
    }
}

I got the following error

{"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put-lifecycle-delete_old_indices_policy) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put-lifecycle-delete_old_indices_policy) within 30s"},"status":503}

What can I do to create the policy?

EDIT: This issue seems to appear for any kind of curl operation I try to make, not only creating policy.

dakrone · May 13, 2019, 9:25pm

This part of the error:

Indicates that within 30 seconds, the master node wasn't able to process the cluster state update when the policy was added. This usually means that your cluster is overloaded and unable to process the request.

EldrosKandar · May 20, 2019, 7:56am

I increased the resources of the server, but I still get the error... I will have a look at activating Monitoring/Uptime to see if it helps get more information on why it fails.

Christian_Dahlqvist · May 20, 2019, 8:41am

How many indices and shards do you have in the cluster? If you have a lot this can lead to a large cluster state and slow updates.

EldrosKandar · May 20, 2019, 8:50am

I've got about 30 indices for the data I ingest each with 5 primary shards and 5 replica shards (now that I set my second node to be a data node too), also about 20 diverse monitoring and kibana indices each with 1 primary and 1 replica. (Which I hope won't get touched by ILM I hope now that I think about it.)

Christian_Dahlqvist · May 20, 2019, 8:52am

How many indices and shards in total in the cluster? Have you read this blog post?

EldrosKandar · May 20, 2019, 9:09am

Using the Shards and Indices API, I count currently 54 indices and 380 shards (190 primary and 190 replicas). All shards are below 10GB (maximum is 8.8). Cluster countains 2 nodes (1 master/data/ingest, 1 data/ingest).

I just read the blog post, but I'll prolly need to re-read, as it is a lot of information to digest.

Christian_Dahlqvist · May 20, 2019, 9:12am

If that is the total number of shards and indices and not what is created daily it does not sound excessive.

EldrosKandar · May 20, 2019, 9:14am

Yes it is the total number at the moment, containing indices which were created in the last 30 days, as well as the .security, .kibana*, .elastalert_status, labels and ingest indices.

After reading the blog, I'm considering reducing the number of shards in the daily indices though, so they are a bit bigger. maybe using the Shrink API.

EldrosKandar · June 4, 2019, 3:00pm

Still having a issue, but I've been dealing on another issue about bulk failure not being handled and my hope is that it will help free some compute resources which would have been taken by the ingest/bulk operations.

EldrosKandar · June 7, 2019, 8:13am

Well I can't do any changes to pipelines, as I get also timeout to policies, so I'm back to square one.

I've seen the Webinar from @davemoore on sizing and scaling and from my limited understanding, it seems that the bottleneck is with the Compute resource, which would mean if we simplify things I could throw more CPUs at the problem, but one could think that 8 CPUs in a 2 instance cluster would be enough... but I have the feeling I am oversimplifying the issue there.

@Christian_Dahlqvist any further idea I could follow or things I should investigate to have a clearer picture of the issue?

EldrosKandar · June 7, 2019, 9:00am

So I can attack the problem from different angle:

Give more resources to the nodes:

This is something I already done relatively recently, as I had performance issue concerning querying and ingesting the logs, I increased CPU and RAM resources on both nodes. Now both nodes have 8 CPU and main and second node have respectively 64 and 14GB RAM. Also I increased Heap Size for the elasticsearch deamon to the maximum of 31GB. Storage is far from an issue, as I have more than half the space available on the volumes I mounted to store the elasticsearch data, but then maybe I have to consider the networking aspect because of it's configuration.

Add node/redistribute the roles:

I have two nodes a big one and a little one. Both have all the roles, although the little one didn't use to have the data role, but it helped me with the previously mentioned. So now, maybe I need to add a node with a dedicated role, or remove one of the role on the existing node?

Update to 7.x

I am currently on 6.7.2, recently updated from 6.3, but maybe I should consider updating?

More monitoring/benchmarking

I could try to set up such tool as Rally, Uptime or Monitoring to get more data and identify the bottleneck.

EldrosKandar · June 26, 2019, 12:46pm

@Christian_Dahlqvist I've just read about the _nodes/hot_threads API and thought it would be useful to find out where the resources are used as it could be useful to determine the bottleneck, however, I have difficulties interpreting the results, here is the summary of the threads for the main node:

74.5% (372.4ms out of 500ms) cpu usage by thread 'elasticsearch[es1][masterService#updateTask][T#1]'
21.7% (108.3ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#3]'
17.4% (87.1ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#4]'
15.5% (77.5ms out of 500ms) cpu usage by thread 'elasticsearch[es1][http_server_worker][T#6]'
15.5% (77.4ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#2]'
13.4% (67.2ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#5]'
10.0% (50ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#6]'
7.8% (38.8ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#1]'
5.7% (28.7ms out of 500ms) cpu usage by thread 'elasticsearch[es1][refresh][T#2]'
5.6% (28.2ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#8]'
4.8% (23.8ms out of 500ms) cpu usage by thread 'elasticsearch[es1][write][T#7]'
4.4% (21.9ms out of 500ms) cpu usage by thread 'elasticsearch[es1][http_server_worker][T#10]'
1.1% (5.2ms out of 500ms) cpu usage by thread 'elasticsearch[es1][management][T#2]'
0.0% (117.5micros out of 500ms) cpu usage by thread 'ticker-schedule-trigger-engine'
0.0% (0s out of 500ms) cpu usage by thread 'ml-cpp-log-tail-thread'
0.0% (0s out of 500ms) cpu usage by thread 'elasticsearch[keepAlive/6.7.2]'
0.0% (0s out of 500ms) cpu usage by thread 'DestroyJavaVM'
0.0% (0s out of 500ms) cpu usage by thread 'process reaper'
0.0% (0s out of 500ms) cpu usage by thread 'Connection evictor'

So it seems a lot of write threads. After I try to update a pipeline (with timeout), a new thread appears named 'elasticsearch[es1][clusterApplierService#updateTask][T#1]'. After trying to add an ILM policy the 'elasticsearch[es1][http_server_worker][T#6]'thread takes most of the time.

I've saved the all stack trace in case details for specific threads are needed.

Anyway, as I suspect the bulk and ingest operation to hog the resources (without any concrete proof, relying only on my observation that data seems to be imported in Elasticsearch just fine), I plan to make a scheduled maintenance and disabled data and ingest role on the cluster for a finite time while I'm trying to do a few operations which I hope will help (especially when it comes to absent failiure handling).

EldrosKandar · June 27, 2019, 12:24pm

Even after I tried to disable the roles, I couldn't execute any REST API call without going to timeout. Slowly it is becoming quite critical that I'm able to do some changes, but I don't know what to do.

Can anybody help me?

Topic		Replies	Views
Failed to process cluster event (put-lifecycle-fb_test) within 30s Elasticsearch	5	234	March 27, 2024
Elasticsearch issue Elasticsearch	6	1138	December 10, 2020
Failed to process cluster event Exception Elasticsearch	31	22385	October 19, 2017
Timeout Elasticsearch	3	959	January 15, 2015
Process Cluster Event Timeout Exception on put-mapping Elasticsearch	11	10390	May 3, 2018

ES 6.x - Timeout by REST API operations

Related topics