I am working on a system to store network flow records in Elasticsearch.
There are too many documents to keep for long periods (~35 million per hour). We wrote a tool to aggregate these into fewer documents (~8 million per hour). Originally, we put all of these aggregate documents into a single index which would be ILM rolled over and aged out.
We ran into a problem where if something happens during the aggregation process (Elastic connection error or program is stopped midway for some reason), then we need to delete any documents that were indexed for this aggregation.
It took a long time to delete so many documents from an index (8 million out of maybe 200 million). We switched to using a single index per aggregation period. Now if something goes wrong, we can simply delete the index and start over, which is very fast.
I'm trying to figure out whether the number of indices is or will be a problem. If we ran hourly aggregations, that would create 8760 indices per year. We don't plan to keep those that long, but even just a month's worth of hourly aggregations would be 730. Or we might want to do daily aggregations for a year, but that would be 365.
I've read that every index is at least one lucene shard, and I need to be careful how many of them I put on a cluster, but the numbers are a bit fuzzy to me. Is 500 too many? 1000? 10,000?
If there is a big problem with using so many indices, does anyone have advice on what I could do to reduce my numbers? Is there a faster way of deleting a few million documents out of a large index? Maybe re-indexing the 24 hourly aggregations into a single index, or just doing daily aggregations?