Hi Team,
We are on Elasticsearch version 8.x and we have this index which for data isolation, security etc reasons we kept as 1 index per tenant. On the index creation since we don’t specify the number of shards, we get the default 1 shard per index.
Most of our indexes are within the safe limit of 30 - 50GB max, however we have few very large tenants which have data that is like 450+ GB.
This data that we are storing in Elastic is to facilitate full text search and simple filters, while for ACID compliance our source of truth is MongoDB and via CDC we ingest the data to Elastic. We don’t have any data retention policy for this index since that’s essentially how it is structured in Mongo as well.
This data is not streams and its mutable. The data goes through create→update→delete cycle and we index these changes to Elastic via CDC. The write throughput is not that high. We were initially thinking to handle these few very large indexes via rollover but index rollover is not advised in scenarios of mutable data like this. Rollover is not recommended for use cases involving mutable data, since document updates would result in duplicates in the new (rolled-over) index instead of updating the existing documents in the original index. Addressing this at the code or query layer (e.g., through deduplication logic or aggregation) would introduce significant complexity, overhead, and still fail to guarantee data correctness.
Next I was thinking to use the split api, since based on the documentation, there is a subtle diff between rollover and split. So does NOT suffer from the duplicate problem that rollover has.
Split does not work for us. here
In order to split an index, the index must be marked as read-only, and have health green.
What I want to via this thread is to get some recommendation/suggestion about is what will be the best option to support such large mutable indexes well in Elastic.
Is Reindexing our only option? Can you please guide?
Thanks,
Moni