We have data that we ingest 24x7. The data is all timestamped (by CreatedDate), but we have it call going into a single index. Our shards get too big (90 gb) and we reindex (which we occasionally do anyway due to changes in our product) into a new index with a higher shard count. I would like to leverage rolling time-based indexes. This works fine if you are dealing with static log files, but not so much with our data. The CreatedDate specifies when the source doc we ingested was created, but they can continue to be updated indefinitely. Our pipeline to Elasticsearch does not know whether this is a new document, or an upsert on an existing.
My current plan is to have an index per month. We can have logic for both indexing and querying that will determine the correct index to target (or indexes if we are working with a date range). If we don't know the CreatedDate when querying, we will target an alias that will hit all indexes within our retention policy.
Does this sound like a good plan of attack?
Also, our current index has 15 shards. If I were to go to rolling indexes, I plan to reduce that to 2 shards for each index. Is there any guidance on where the sweet spot is between having many shards so things could be parallelized vs fewer shards so that all the data you are querying is located in close proximity? I know... it depends on your data and your hardware.
Thanks,
~john