What about a min_docs parameter for rollover action in Index lifecycle management?

Hi all,

I've played with the Index Lifecycle Management aka ILM for indices not being time series, but more in a classical datastore fashion (ie, need to reindex data from scratch on a regular basis, with few changes each time, previous existing index being dropped).

So I've defined following policy;

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "20gb",
            "max_age": "30m"
          }
        }
      },
      "warm": {
        "actions": {
          "forcemerge": {
            "max_num_segments": 1
          },
          "shrink": {
            "number_of_shards": 1
          }
        }
      }
    }
  }
}

For indices having those settings:

{
    "number_of_replicas": 0,
    "number_of_shards": 8,
    "refresh_interval": "30s"
}

I'm trying to bulk index about 60 millions documents, for a total size of about 70GB.

Since this data is not time series, the ILM policy creates a new index every 30 minutes (because of the max_age param), even after bulk indexing is finished, by moving the previous in the warm phase (shrink+forcemerge), which makes the number of empty indices increasing indefinitely. Also I cannot define a delete actions because I don't want my data gets thrown away without a specific action.

The max_age param in the policy was to not let few documents to much time in an index (hot phase) with lots of small shards, in order to limit the overhead on the cluster state (with the idea to avoid the gazillion shards problem), such policy being applied to several other indices.

To balance that, I was thinking about a not yet existing min_docs parameter available on the rollover configuration (to be taken into account only with the max_age param in order to avoid indecision with a max_size param), so that it can be activated only when a minimal amount of documents have been indexed in the new hot index. ie:

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "20gb",
            "max_age": "30m",
            "min_docs": "1"
          }
        }
      },
      "warm": {
        "actions": {
          "forcemerge": {
            "max_num_segments": 1
          },
          "shrink": {
            "number_of_shards": 1
          }
        }
      }
    }
  }
}

With such config, the rollover action would be executed either when the index size hits the 20gb threshold or when the index creation date is higher than 30 minutes AND it holds at least 1 document.

Is that have been discussed in the past?

Another way to go imo is to handle myself the shrink then forcemerge then delete the 8 shards index once the indexing process is finished, but I'd wonder if the ILM could handle such usecase?

Does anyone has a better idea?

Few information about my cluster:

GET _cluster/health
{
  "cluster_name" : "es-dev",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 237,
  "active_shards" : 279,
  "relocating_shards" : 3,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Nodes:

GET _cat/nodes?v&h=node.role,master,name&s=name
node.role master name
di        -      es-dev-1
di        -      es-dev-2
di        -      es-dev-3
di        -      es-dev-4
i         -      es-dev-coordinator
m         *      es-dev-master

This hasn't been discussed before.

It sounds a little like you want something like a min_age predicate rather than the max_age so the index doesn't roll over until it is at least 30 minutes old, is that correct?

Out of curiosity, why do you have the 30 minute rollover check currently? You said:

The max_age param in the policy was to not let few documents to much time in an index (hot phase) with lots of small shards, in order to limit the overhead on the cluster state (with the idea to avoid the gazillion shards problem), such policy being applied to several other indices.

But in your case, if you removed the 30 minute rollover check and left the 20gb check, it would be completely fine to have only a few documents in the index with a hot phase, since (assuming you are indexing into the alias), there is only going to be one "hot" index at a time.

Hmm, I'm not sure that a min_age would fit the use case, because it would not prevent the rollover to happen indefinitely even after the end of the indexing process, if I understand well.

In fact, I first tried without any age related predicate, only max_size. It worked well except for the last index automatically created by the rollover aliased, which can contain very few documents. I know it can be fine because it is the only one with 8 shards for the ILM managed index, but I'd like to avoid having indices with several shards, because the same policy could be applied to numerous other indices, so I'd like to limit the number of shards in the cluster.

I know I could handle that externally, but was wondering if the ILM could help.

Or another idea which comes to me now, would be adding some parameter to a delete action to drop index only if they are empty.

ILM looks to fit very well for time series indices, but is it recommended also for other datastore indices?

There isn't currently a parameter for that, and this is an unusual use case, so I'm not sure it makes sense to add one.

I can say that having a single index with 8 shards and only a few documents as the "hot" index is a better state than having multiple empty indices.

ILM looks to fit very well for time series indices, but is it recommended also for other datastore indices?

Sure, ILM can be used for any sort of indices, the only requirement is that you have a time-based sequence of actions you want to take on an index. It doesn't have to be time-series data.

I understand for the delete action.

I fully agree, and it is safer indeed when we have a hot index. But what about indices which we know that they will never be updated anymore? I wasn't able to find a way with ILM only, to make all its data merged into a cold index without using the max_age parameter, but as mentioned, this one creates continuously empty indices.

I think I could get a workaround by effectively using this max_age parameter and when the indexing process has finished, I could update the policy to remove this parameter in order to avoid the continuous index creation.

But it would make the indexing process aware of the ILM policy, which doesn't look to be coherent with the aim of the ILM process, doesn't it?

Do you think I might create a Github issue to discuss that further, or this topic is sufficient for my too rare my use case?

I have created an issue to explicit the feature request: https://github.com/elastic/elasticsearch/issues/45900

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.