How to best handle large Elastic indexes

Hi Team,

We are on Elasticsearch version 8.x and we have this index which for data isolation, security etc reasons we kept as 1 index per tenant. On the index creation since we don’t specify the number of shards, we get the default 1 shard per index.
Most of our indexes are within the safe limit of 30 - 50GB max, however we have few very large tenants which have data that is like 450+ GB.

This data that we are storing in Elastic is to facilitate full text search and simple filters, while for ACID compliance our source of truth is MongoDB and via CDC we ingest the data to Elastic. We don’t have any data retention policy for this index since that’s essentially how it is structured in Mongo as well.

This data is not streams and its mutable. The data goes through create→update→delete cycle and we index these changes to Elastic via CDC. The write throughput is not that high. We were initially thinking to handle these few very large indexes via rollover but index rollover is not advised in scenarios of mutable data like this. Rollover is not recommended for use cases involving mutable data, since document updates would result in duplicates in the new (rolled-over) index instead of updating the existing documents in the original index. Addressing this at the code or query layer (e.g., through deduplication logic or aggregation) would introduce significant complexity, overhead, and still fail to guarantee data correctness.

Next I was thinking to use the split api, since based on the documentation, there is a subtle diff between rollover and split. So does NOT suffer from the duplicate problem that rollover has.

Split does not work for us. here

In order to split an index, the index must be marked as read-only, and have health green.

What I want to via this thread is to get some recommendation/suggestion about is what will be the best option to support such large mutable indexes well in Elastic.

Is Reindexing our only option? Can you please guide?

Thanks,

Moni

You will not be able to do this without preventing writes and deletes to the index for a period of time. The write block that the split API requires is only during the operation.

Reindexing does not solve this as reindexing only reflects that state of the index when the operation starts, so would lose any writes or deletes that occur while it takes place. The split API should also be a lot faster than a reindexing, which would make it easier to use.

I believe we would need some kind of dual ingestion that until reindexing completes writes to both old and new index?
What do you suggest for tenants with large data thats mutable and how we can support that well in Elastic? what option we have with zero to minimal downtime and no data loss?

Approaches with dual writing might allow you to do it without downtime, but this is in my experience difficult to do in practice without data loss or inconsistency.

I would recommend testing how long the split APi takes to execute for large indices and use this method as it likely is the best way to achieve an increase in primary shards with minimal write downtime.