Why is rollover not automatic and field based?


I was just going through a closed issue which is of my interest.

Reference link: https://github.com/elastic/elasticsearch/issues/22392

We are actually planning to write some data to Elasticsearch in a time based manner which rollover already does, however as mentioned in the issue and post reading I realized that rollover has to be
called explicitly and does not happen automatically. This in no way looks different than calling the create index API everyday at 12 AM instead. Are there any added advantages to calling the rollover API?

I was curious as to if there are any plans to incorporate any similar features of rolling over indices automatically or based on fields.Field based rollover gives me the advantage of ensuring I am not keeping any additional data in my current indices. Delay in data pipelines or back filling of data is a pretty normal use case for pipelines in which case ES fails to serve since I will have to handle it manually in the code.

the rollover API is not necessarily aiming to build time-based indices. It's purpose is rather to help to build the perfect shard size. Lemme give you an example:

you have 5 Nodes and you want to maximize throughput. So what you could do is allocate an index with 5 shards and index full speed. Now when you do that with a time constraint ie. 1 index per day you end up with 365 * 5 = 1825 shards after 1 year. The most of these shards might not have the perfect size ie. containing max number of documents while still serving you SLAs. Now with rollover you can specify for instance how big your index should be ie. num-docs or size in bytes or how old it should be etc. Once you rolled over you can use the _shrink API to shrink the index down to a single shard which optimizes your cluster for capacity planning.

You can also just use it to create time-based indices and the advantage you have is that it will update and maintain your write alias such a that you don't need to maintain the alias yourself. Note: it won't guarantee any atomicity ie. docs in flight might still go to the old index etc. we don't have any tools that can guarantee anything like this.

Rollover is a manual step that should be triggered by a cron job something like every minute or so. The reason why we don't have that built-in is error reporting,. What do you do if we can't rollover etc. how do you communicate this. It's difficult to give a good enough answer. From the users perspective there might be a cron like service in there infra they can leverage that sends mails if things go wrong.

I just have a suggestion as to this feature. I feel to solve the documents in flight issue, if we could move the logic as to which index the document has to be in by deciding this on the client instead of the server. Also, since index creation will be managed by master nodes, this should also not create issues while receiving multiple requests for creating an index.

you can do this already, this is not the purpose of this API

I was wondering if this could be picked up as a separate feature in ES, not as a part if this API but separately.

what exactly do you mean, can you elaborate?

If we could define an alias as a time based alias, we could have a feature built in the elasticsearch client which decides which index to write to. This way, we could be strict in the index to which the data is being written.

Taking a sample, we have an alias 'foo' which is defined as a time based alias with index format as foo-yyyy.mm.dd

We get a document at Jan 10, 2018 to write to the index. The ES client infers the index to be written to is foo-2018.01.10 and writes the data to the specified index or creates it if required.
We get another document at Jan 11, 2018 to write. Index inferred will be foo-2018.01.11 and written.

This can also be done on the basis of an attribute in the document itself which can be defined as a configuration to the alias.

I hope this clarifies. Let me know in case of a doubt.

ok but why don't you do this on the client and send the doc to the right index from the beginning. Why do you have to have support for this on the server end?

1 Like

Yes, that is exactly what we are currently doing. But I guess I did find a flaw in my idea itself. I was of the notion if we could somehow build this as a part of the language specific client that ES gives. But this would defeat the purpose of Elasticsearch supporting HTTP API calls keeping it language independent.

It clears my doubt thanks.

you are very much welcome. Happy you asked.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.