Hi,
I have an Elasticsearch setup which is running version 8.18.1, which is used for bulk data logging and searching. Initially, I was using the default settings, but once my index became larger than 20GB, I decided to use the rollover feature to divide it into manageable chunks as recommended.
However, there is an issue that previously did not exist. While inserting new records works fine, I also need to be able to modify old ones, whether that be to add a new field or change the value of an existing one. Once a rollover occurs, the old index is locked, and the index alias cannot be used to modify data in previously rolled-over indices. I understand this is intended behavior, which is why I ideally don't want to change these settings.
For a real-world example, there is a reputation-based system for an online game, using third-party profiles to support cross-platform play. The ID field is unique, and should be able to be queried unambiguously. A user who was first seen and rated many years ago, in a now rolled-over index, changed their username and logged into the game server. The game server wants to send this new name into Elasticsearch, so he can be queried by users without fetching his internal ID.
Issue: We need Elasticsearch to update his record with this new username, but the user's record is stored in a rolled-over index.
What alternatives exist to balance out a large index while being able to add to and modify it without interruptions? I also would not like to delete and re-create a record, since in rare circumstances it may be required to rapidly update a record.
How large are we talking about? You mentioned that you are rolling over when it reaches 20 GB, but this is half of the recommended size for rollover, which is 50 GB per shard.
If you can estimate the size of your index you may just create it with a higher number of primary shards so each shard can be in the recommended size range.
When it exceeds this size, you may reindex it into a new index with more shards.
Having the need to update documents is a use case the conflicts with having rollovers as rollovers are expected to happen on append-only data.
One alternative would be to target the update to the rolled over index, and not the write alias and use an update_by_query for example.
Not sure what i the issue here, every update in Elasticsearch creates a new record and mark the old one as deleted.
Forgive my ambiguity - by "delete and re-create" I had meant having to do so programmatically.
This appears to be an error and should indeed be set to 50GB, thank you for pointing it out. Rollover being targeted towards read-only data makes a lot of sense, it looks like I have the wrong use case for it.
Your suggestion tells me the correct solution here would be to create a static index with a higher number of primary shards (up to 50GB per shard) and if the index grows larger than this, I should use the reindex API specifying more shards. I already know the existing size of the index, so this greatly simplifies everything.
If you need to update data deleting adat gets more expensive as it is a lot cheaper to delete old indices than delete documents from within an index (each deleted document basically requires an update as a tombstone record is created). If you still need a single index I would recommend hiding this behind an alias so you can point this to a new backing index when required. If you require consistency and want to keep downtime to a minimum I would recommend looking at the split index API instead of using reindex. This should creates a copy of an index with a greater number of primary shards and should be faster and require less downtime than a potentially lengthy reindexing operation.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.