Hi all,
I've played with the Index Lifecycle Management aka ILM for indices not being time series, but more in a classical datastore fashion (ie, need to reindex data from scratch on a regular basis, with few changes each time, previous existing index being dropped).
So I've defined following policy;
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "20gb",
"max_age": "30m"
}
}
},
"warm": {
"actions": {
"forcemerge": {
"max_num_segments": 1
},
"shrink": {
"number_of_shards": 1
}
}
}
}
}
}
For indices having those settings:
{
"number_of_replicas": 0,
"number_of_shards": 8,
"refresh_interval": "30s"
}
I'm trying to bulk index about 60 millions documents, for a total size of about 70GB.
Since this data is not time series, the ILM policy creates a new index every 30 minutes (because of the max_age
param), even after bulk indexing is finished, by moving the previous in the warm phase (shrink
+forcemerge
), which makes the number of empty indices increasing indefinitely. Also I cannot define a delete actions because I don't want my data gets thrown away without a specific action.
The max_age
param in the policy was to not let few documents to much time in an index (hot phase) with lots of small shards, in order to limit the overhead on the cluster state (with the idea to avoid the gazillion shards problem), such policy being applied to several other indices.
To balance that, I was thinking about a not yet existing min_docs
parameter available on the rollover
configuration (to be taken into account only with the max_age
param in order to avoid indecision with a max_size
param), so that it can be activated only when a minimal amount of documents have been indexed in the new hot index. ie:
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "20gb",
"max_age": "30m",
"min_docs": "1"
}
}
},
"warm": {
"actions": {
"forcemerge": {
"max_num_segments": 1
},
"shrink": {
"number_of_shards": 1
}
}
}
}
}
}
With such config, the rollover action would be executed either when the index size hits the 20gb threshold or when the index creation date is higher than 30 minutes AND it holds at least 1 document.
Is that have been discussed in the past?
Another way to go imo is to handle myself the shrink then forcemerge then delete the 8 shards index once the indexing process is finished, but I'd wonder if the ILM could handle such usecase?
Does anyone has a better idea?
Few information about my cluster:
GET _cluster/health
{
"cluster_name" : "es-dev",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 4,
"active_primary_shards" : 237,
"active_shards" : 279,
"relocating_shards" : 3,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
Nodes:
GET _cat/nodes?v&h=node.role,master,name&s=name
node.role master name
di - es-dev-1
di - es-dev-2
di - es-dev-3
di - es-dev-4
i - es-dev-coordinator
m * es-dev-master