Hi everyone, sorry for the somewhat generic title, hopefully I can elaborate effectively.
We're using Elasticsearch to store logs from various applications, operating systems, and network devices ("multiple sources"). We currently create an index based on the data source so that like logs are stored together (firewall, network, windows, unix, etc). Each data source has their own index with 4 shards and 1 replica each. On top of this we're using daily indexes. The amount of data varies greatly between the different data sources (100GB on the very high side, 10GB on the low end). The number of fields also varies by data source with some having more than others. Most fields are integers or not_analyzed, but most documents have one field that uses the standard analyzer. We typically perform searches per data source, but will often search for terms across all indexes as well (alias for that day).
This generally works well, but the shard count can get rather high after a month of data (retention requirement). After going through a cluster restart today and waiting for all those shards to initialize I started thinking about our architecture a bit.
Would there be any drawback to having a single daily index for all data instead of having these per data source indexes?
Elasticsearch: 1.4.4
JVM: 1.8
Nodes: 4
The only benefit I can think of with the multiple indexes would be that searches related to "firewalls" would be constrained to that index. That's nice, but we're not search/score heavy (in fact we don't use scoring at all).
What about IO? Would using one big index split across the 4 nodes vs multiple, smaller indexes make any difference in that regard? We're not using SSDs so IO hurts sometimes.
Ok question is getting a little long winded, but if anyone has some advice or guidance I'd certainly appreciate it. If you want anymore details from me just let me know. Thanks!
Edit: The average total for a day's worth of logs is about 256GB.
@otisg Thanks for the tip. I've created new templates that do just this, but before I pull the trigger and test I was wondering if there were any performance-specific pros or cons with doing so. Guidelines or best practices is more what I'm looking for. I'm certainly not looking to drag someone into a long forum exchange to get precise numbers based on my architecture.
Like @otisg says, types might be the right thing here. Its what they are for. Each has a fixed cost in heap and cluster state maintenance time for types can help here. The drawbacks come around scoring (IDF information is shared), aggregations (types can hide sparseness and make the doc values less efficient), and types (if two devices send data with the same name but a different type then you'll have a bad time). If you can live with these things then I think you should look at types.
Sparseness is the biggest problem I can think of - I don't know the doc value data structures super well but think of it like this: they pick the number of bits they use to represent integers for an entire segment. If you have two devices, one that only stores data in the 0-255 range and another that stores in the 0-64k range and you mix them into one segment then the whole segment will need two bytes per integer rather than the one. But if you only use these value for searching (not sorting or scripts or aggregation) then this shouldn't come up.
You might have some success with the total_shards_per_node setting - its a bit touchy because its a hard limit, but it could be useful to make sure that the shards you are actively writing to are more evenly spread out.
In general you'll get more IO load on indexing with one large index than four smaller indexes. But your milage may vary. Its complex.
Thank you very much for the feedback. Certainly gave some insight into things I wasn't even thinking about (aggregations and sparseness). I've got a test case now with one index and types so we'll see how it goes.
What are your retention policies for each kind of data source? This ultimately should be the determination of whether to use separate indices or one. Generally speaking more than one _type in an index should be avoided. When the data is in separate indices then you can snapshot/restore at that granular level and also delete older data after the desired retention policy.
Regardless of which way you go above you can tune the index template to adjust the number of shards for each index accordingly for the size of data and you're desired index rate. This will help you keep the number of shards under control for the size of your cluster and the amount of resources it has.
Regardless also please consider upgrading to the latest 1.x version (1.7.3) as you will gain mass improvements in restart performance due to the synced flush improvements that came along in 1.6.0+.
Retention policy is the same across all data sources. 6 weeks then we close the index but keep it around for 6 months before deleting. However some data sources are bigger than others, so perhaps tuning shard count per index as you suggest would be beneficial, I'll investigate today.
I didn't know that more than one type in an index should be avoided. Thanks for the insight there. Is this just common knowledge or is there some Elasticsearch documentation I missed. Definitely good to know though, thanks.
We're making the upgrade to 1.7.3. I've recently read up on the synced flush and now have a maintenance window in 2 weeks to update. Thanks for the info on that as well. Much appreciated.
Ah yes, I've read up on that. I think we'd be good to go on that front since fields between the types we would be using wouldn't be ambiguous or have different mappings. Thought I could see that turning into a potential issue as new data sources are created. I've read up on how 2.0 should help solve that problem as well (though 2.0 is still a bit out for us)
I've had a similar issue. However we also had restrictions regarding the query time of certain types.
From what I've seen if we put a mix of a lot of different types and documents into a single shard, querying a single type could have an overhead compared to if it had been on a single index on its own.
Also I think that if your different types share certain field names, it could lead to an impact on query performance, as i think it creates a bit mask for all documents that has that field..
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.