tl;dr fewer data nodes with more disk space each vs more data nodes with smaller disk space each?
Our current setup consists of 3 master nodes, and 4 data/ingest nodes with large amounts of disk size each. Our daily indices are roughly 400GB, and so we have 8 primary shards to obtain shard size of ~50GB, we also have 1 replica per shard.
How is performance affected by the data node count? Our for 4 data/ingest nodes have large disks, we are thinking about doubling up the number of data nodes but halving their disk size. IS this beneficial?
can you maybe explain what makes you think to do this switch? There are many things to factor in, so this is not a black/white answer. Less nodes, probably means less moving parts in a distributed systems. More nodes probably means more throughput, as you can write more data in parallel (unless you decide to have more but less powerful nodes). More shards means, you probably need slightly more space. You could also use index lifecycle management to reduce the number of shards when using a time based data flow... many things to factor into account.
So, what is the primary driver of adding/removing nodes? And what is your definition of performance? Faster indexing? Faster querying?
Re: less nodes = less moving parts: in our case is not such a big deal as we are running ES in kubernetes which manages the complexity for us, allowing us to scale up and scale down data nodes quite easily.
Re: more nodes = more throughput, that sounds like a good thing, although we don't have any significant delays in logs reaching ES atm.
Re: I did not know more shards increased disk usage? But it's been said that shards should not be much larger than 50GB, although I don't have a deep understanding as to why, this along with what a single shard daily index measures in out case (~400GB) helped us decide to go with 8 shards.
Furthermore, after increasing the shard size, we notice that resource usage is much more equally spreadout amongst our 4 nodes.
Re: ILM, we are using it to change the shard count but that's it.
Faster free text searches is our main goal atm. But we also welcome increased throughput, although not a main pain point today. I've also noticed that at any given time, 2/4 nodes are utilizing much more disk space than the 2 others, it would be nice to normalize disk usage across all nodes, if possible.
don't be too worried about the slightly bigger size. But now there is at least a goal, faster free text searches.
How does such a search look like? Does it always need to cover the full data set? How do your filters look like? Are you searching for documents or just aggregating data and thus could use the request cache? How much main memory do you have available for the file system cache?
As you can see, there are a couple of different angles to tackle this problem, and maybe there is a smarter approach than just throwing hardware at it (or maybe not)
Our use of ES through kibana to is to investigate logs. We filter on fields wherever possible, but when we don't have those handy, then we resort to searching the entire dataset. We are looking to improve the latter. But also just general guidelines and how to determine a proper data/ingest node count.
So far you've mentioned more data nodes will help increase write throughput, does it affect read throughput as well?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.