Process_cluster_event_timeout_exception error on put-mapping in test environment

Hello,

I'm using the docker.elastic.co/elasticsearch/elasticsearch:7.17.5 docker image for running our tests in the CI server on a single node configuration.
We're frequently getting the following error while running the tests:

Elasticsearch::Transport::Transport::Errors::ServiceUnavailable:
       [503] {"error":{"root_cause":[{"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put-mapping [local_chapter_organiser_requests10/dzf0K-4nRlGVnpO7bHSvlQ]) within 30s"}],"type":"process_cluster_event_timeout_exception","reason":"failed to process cluster event (put-mapping [local_chapter_organiser_requests10/dzf0K-4nRlGVnpO7bHSvlQ]) within 30s"},"status":503}

We run on a ruby stack so we're using the elasticsearch ruby gems.
We've tried to set the master_timeout and timeout parameters while creating the indexes as described in Update mapping API | Elasticsearch Guide [8.3] | Elastic, but this haven't had any effect on the issue. It's definitely possible that we did this incorrectly as the ruby gem's docs aren't super clear on how to pass these params, but before digging into that, I wanted to confirm if setting any of these to a longer timeout would fix the timeout issues.

Thanks!
Diego.

That should never take that long so something is likely wrong, and I do therefore not think increasing the timeout will help. How much memory and heap does the node have available? Is there anything else running on the node? How much CPU is allocated? Does the cluster hold a lot of data or a large number of indices and shards? What type of storage are you using?

Hi Christian, thanks for replying.

We're limiting the heap size to 512mb by passing the ES_JAVA_OPTS=-Xms512m -Xmx512m environmental variable to the container. The node shouldn't have much data, we delete the indices after each test runs to ensure they're independent. We do run our tests in parallel processes using a single elasticsearch node, so we use suffixes at the end of the index names to ensure there's no clashing between processes.
If you think it's a resource constraint issue I can give it a try allocating more resources for the ES node.

If you do not have a large cluster state due to large number of shards I would look for evidence of long GC in the logs or high iowait at the storage level. CPU allication could possibly also play a part if very low.

We spent some more time on this. After instrumenting our servers to get more fine-grainde metrics, and also decreasing the number of parallel tests we run we're still seeing the issue while CPU and I/O both look fine.

One thing I couldn't find guidance anywhere is around how to run integration tests with Elasticsearch. We currently create indexes for all types of documents before each test run, and delete those indexes after the test finishes. We've also tried deleting all indexed documents after each run without much success either.
Do you have any recommendations on how to run ES for a short time and for only a few documents at a time? Is there a way to run in-memory to avoid I/O waits?

If you are making a lot of changes in parallel that affect the cluster state in a short amount of time I guess these could be queued up and take a while to process as the cluster state AFAIK is updated and propagated in a single thread for consistency.