Hi everyone, sorry for the somewhat generic title, hopefully I can elaborate effectively.
We're using Elasticsearch to store logs from various applications, operating systems, and network devices ("multiple sources"). We currently create an index based on the data source so that like logs are stored together (firewall, network, windows, unix, etc). Each data source has their own index with 4 shards and 1 replica each. On top of this we're using daily indexes. The amount of data varies greatly between the different data sources (100GB on the very high side, 10GB on the low end). The number of fields also varies by data source with some having more than others. Most fields are integers or not_analyzed, but most documents have one field that uses the standard analyzer. We typically perform searches per data source, but will often search for terms across all indexes as well (alias for that day).
This generally works well, but the shard count can get rather high after a month of data (retention requirement). After going through a cluster restart today and waiting for all those shards to initialize I started thinking about our architecture a bit.
Would there be any drawback to having a single daily index for all data instead of having these per data source indexes?
The only benefit I can think of with the multiple indexes would be that searches related to "firewalls" would be constrained to that index. That's nice, but we're not search/score heavy (in fact we don't use scoring at all).
What about IO? Would using one big index split across the 4 nodes vs multiple, smaller indexes make any difference in that regard? We're not using SSDs so IO hurts sometimes.
Ok question is getting a little long winded, but if anyone has some advice or guidance I'd certainly appreciate it. If you want anymore details from me just let me know. Thanks!
Edit: The average total for a day's worth of logs is about 256GB.