Constant merges in new index

When creating a new index, we often see a significant and sustained increase in the number of internal refreshes. Merges are constantly triggered, affecting CPU usage and search speed accordingly. The only workaround we have found so far is to recreate the index with exactly the same parameters (preferably after restarting the cluster), but this is not always effective.

What could be causing this issue?

Some details:
Elasticsearch 7.17.5
refresh_interval - 5 minutes (though this seems unrelated, as the problematic index consistently shows a much higher number of internal refreshes compared to external ones)
We don't use Reindex API. Our service creates a new index with the same templates and then indexes the same documents there.
We rarely recreate indexes, so it's unclear what could be triggering this problem now. The only recent changes were the addition of new fields and a rank-feature.

I'm attaching a graph for elasticsearch_indices_merges_total_time_seconds_total parameter. You can see how merge time increased after creating a new index.

Welcome!

The more data you add to the index, the more time it will take to merge the segments.
But that's just a guess...

Does it affect the cluster behavior?

Yes, but how adding more data may explain why we constantly have internal refreshes? And why the issue may be solved by creating a new index with the exactly the same parameters and data?

What do you mean by affecting the cluster behavior?

I have no idea. I don't really understand what the problem is to be honest.
What are the index settings?

What do you mean by "internal" refreshes? Do you mean not caused by an explicit refresh API call or bulk parameter?

Is that happening on the whole cluster or only for this index?

What is the output of:

GET /_cat/indices?v

What do you mean by affecting the cluster behavior?

Does it slow down anything, or make your cluster crashes? What's the problem?

What are the index settings?

   "settings": {
      "index": {
        "refresh_interval": "300s",
        "translog": {
          "flush_threshold_size": "1gb",
          "sync_interval": "300m",
          "retention": {
            "size": "1gb"
          },
          "durability": "async"
        },
        "auto_expand_replicas": "0-all",
        "soft_deletes": {
          "enabled": "false"
        },
        "creation_date": "1716732591241",
        "sort": {
          "field": "SourceId",
          "order": "asc"
        },
        "number_of_replicas": "37",
        "version": {
          "created": "7170599"
        },
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_content"
            }
          }
        },
        "number_of_shards": "1",
        "merge": {
          "scheduler": {
            "auto_throttle": "false"
          },
          "policy": {
            "max_merge_at_once": "2",
            "max_merged_segment": "100gb",
            "segments_per_tier": "2",
            "floor_segment": "700mb"
          }
        }
      }
    }

What do you mean by "internal" refreshes?

In the Index stats, the value of "external_total" is much less than "total". Like:

"refresh": {
          "total": 1264,
          "external_total": 318 },

Usually, these 2 numbers are very close. Like 264 / 218

Is that happening on the whole cluster or only for this index?

We have only one updated index in the cluster. But I guess it's related to the specific index. Not the cluster. Because this behavior changes when we create a new index in the same cluster.

What is the output of...

We don't have any active index with this issue, but the output didn't differ except for the number of deleted documents that was smaller. I guess because of constant merges.

health: green
status: open
pri: 1
rep: 37
docs.count: 5987810
docs.deleted: 17987
store.size: 104gb
pri.store.size: 3gb			

Does it slow down anything, or make your cluster crashes? What's the problem?

It doesn't cause crashes, but CPU usage is higher and our search queries are slower.

After we recently recreated the index, the same issue reoccurred in several clusters.

Currently, these clusters are undergoing constant merges and refreshes, which is leading to higher CPU usage and affecting our search latency.

Restarting the cluster and recreating the index occasionally resolves the issue, indicating that it is not related to the number of documents or fields.

If anyone would like to assist in investigating this issue further, the live index with this problem is now available for examination.

What is the specification of the cluster? What type of hardware are you using? Are you using local SSDs?

Here's our DBA's answer:
Most instances are 12 cores with 64 GB ram. We tore the data in memory, so no data disks.

Elasticsearch requires a data path to store data in to be configured, so it is not an option to just store it in memory. I have heard people try to use a ramdisk for this, but am not aware of anyone being successful.

How exactly are you doing this?

If your data set is small enough to fit in the operating system page cache it will effectively be held in memory for querying (not sure if that is what is referred to) but any changes to data will still need to be persisted to disk, which will cause merging.