I have since updated to 7.15 and have reduced the number of shard per index to 2 (from 5) with 1 replica. The cluster runs on 3 nodes with 16Gb or memery each.
However since I have many indices 1800+ I still end up with many shards. From the docs the suggestion seems to be 20 shards per Gb of RAM. This is far lower that the number of indices I have.
The template is very simple. Only 5 fields per index. They contain time series data, from many different devices. Currently I have 1 index per device... but as you can imagine this will keep growing. In total the DB is using 70Gb per node.
I am wondering if I should restructure my data to make things more efficient and allow the server to cope better. But I am not clear on whether this would be the right way to go.
I am considering, moving all the data into 1 index, adding the deviceID as a field of the template. Then use ILM to create indices, to manage the size of each index/shard, limiting it to 10Gb or so.
How will this impact my queries, reading and writing and resource requirements on the server. Would search one index (and having to filter by deviceID as well) be significantly slower and would there be a change in the resourced required to manage this change it shema?
An index per device sounds very inefficient. Over 1800 indices for only 70GB of data sounds wasteful.
This is what I would recommend.
Querying could potentially be a bit slower if you usually query only a single device ID, but it is a solution that will scale much better and allow you to handle larger valumes of data. This will all depend on your queries and latency requirements.
One way to improve query performance, if this turns out to be a problem, while still combining device data into fewer larger shards could be to use routing. If you make each time-based index e.g. have 10 primary shards and use routing by deviceID when indexing and querying, it allows you to only query 1/10 of the shards if you filter for a single device ID. You can still run queries across all shards if you are querying multiple device IDs.
Hi @Christian_Dahlqvist, thanks for the quick reply... and glad to hear I am heading in the right direction.
For the above config, what would you suggest is a good starting point for the number of shards for the index, considering ILM as well?
I've looked into routing Customizing Your Document Routing | Elastic Blog and as my devices will generate different amounts of data. What ratio here would start being a problem in terms of the shard sizes?
E.g. 1 device has 6M docs while others have 10k documents. Routing does not balance the shards I believe. How do I decide if this is a problem in my instance. Is there a common sense limit here where custom routing start causing a problem in terms of the balance of shard sizes or "hotspots"?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.