ES is designed for horizontal scaling. Does it mean more node add to the cluster, the search performance could performance faster?
I executed the same query on a single node and a cluster contains 3 master, 4 data, 1 client nodes. However, the search query seems slower when taking a look at the "took" in the response of each query.
This will depend on what limits the performance of your query and whether it is able to benefit from the additional system resources available in the cluster. When you are talking performance, are you referring to query latency or query throughput?
What is your setup? How much data? How many shards? What type of queries?
I have 43G address data with Xeon E5 2.2GHZ CUP, 16G RAM, and regular hard drive. I used them to do geocoding query(bool query firstly filter match with postal code and house number as keyword, then match with street name as a string ), K-nearest neighbor from a given latitude/longitude and aggregation by geohash.
I created 4 shards, no replica. And for the four data node cluster, each node contains one shard. I track the "took" parameter for each query, and found the average response time of 1 million queries is slower than that on a single node(both as master, data, and client node).
With more an increasing number of shards and nodes, there is more data that will need to be moved between the nodes, and it may be that this slows down the latency. I would however accept you to be able to handle a higher concurrent query throughput with the larger cluster.
Benchmark with as realistic load as you can (type and volume). You can tune the cluster differently depending on whether you are looking for low latency of few concurrent queries compared to very high number of concurrent queries.
Actually, my data is kind of static, once I indexed it, I will not add more to it.
What if I index more data to both single node and the cluster, the single node would be slower than the cluster one?
Can I say the cluster cannot improve the single query time, however, it would increase the capability to handle the high concurrency query? By the way, for the cluster do I need to change the thread pool setting in order to improve the concurrency capability?
I really appreciate your time and efforts regarding this.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.