We have a setup with an ES cluster with three nodes. On the cluster we have an index with our product catalog.
When creating this index, we do not explicitly set any shard or replica size.
The index is around 1.2gb in size, containing around 150.000 documents.
What we are now experiencing is that some of our search results are not looking as we would like, and I've determined that is probably because the documents are fragmented in too many shards, causing the scoring to not be consistent across shards.
What I am thinking now is to make the index have just one shard. But having three nodes in the cluster, what should I then set the replica number to? Should I just leave the default (1), or would that render one of the nodes obsolete? Or will the index still be replicated to all three nodes, if there is only a single shard?
Or should I specify 2 replicas, because I know the exact size of my cluster?
You haven't specified the number of shards used in your setup, however the default generally is 5 so I believe that'll be in your case. I assume it is not a kagillion shard case here (Kagillion Shards | Elasticsearch: The Definitive Guide [2.x] | Elastic). 150K documents is relatively smaller and a single shard should be able to handle that quite easily, provided you have enough computational and memory resource available to you.
What we are now experiencing is that some of our search results are not looking as we would like, and I've determined that is probably because the documents are fragmented in too many shards, causing the scoring to not be consistent across shards.
Can you elaborate a bit about the inconsistency that you are observing.
What I am thinking now is to make the index have just one shard. But having three nodes in the cluster, what should I then set the replica number to? Should I just leave the default (1), or would that render one of the nodes obsolete? Or will the index still be replicated to all three nodes, if there is only a single shard?
Increasing the number of replicas only adds to fault tolerance in case of a node failure. Setting up N as number of replicas adds tolerance for a max of N node failure. If you have only one index in your setup, one of the nodes will be idle when replica value is set to 1. But when you have multiple shards, shards are allocated on the basis of load across different nodes.
You are correct, the current shard count is just the default of 5.
What I am seeing in my results is something like this: If I search for "horse" then I would get results similar to:
"horse" (shard 1, score 9.9)
"horse feed" (shard 1, score 9.7)
"horse" (shard 2, score 9.6)
"horse feed" (shard 2, score 9.5)
So "horse" will score better than "horse feed" within the same shard, but "horse" in shard 2, might be lower that "horse feed" from shard 1.
With regards to the cluster, we basically have three webservers, and each server has it's own ES node locally. So in my mind, it makes sense for performance reasons, to make sure that the product index is replicated to all 3 nodes, so each webserver has the least amount of network latency when querying. Is that a sensible conclusion?
We do have other (smaller) indexes on the same cluster, but I would do the same thing for all the indexes, so each ES node has all shards at any time.
It may not be a good solution in the longer run. Instead of having nodes with data role enabled, you should have coordinating only nodes running on the same web servers (https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-node.html). Keeping multiple data nodes and running them may have cost impacts specially when you have larger data and as web servers gets scaled. Also the web server itself might be consuming some memory, so down the line you may face some situations in which enough memory is not available to the ES node. You can find some documentation here (https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html)
I cannot share the documents it their current form, but I might try to reproduce in a simplified index.
I think it is the "Inverse Document Frequency" that causes the score to vary across shards, because not all shards contain the same amount of documents containing "horse".
I just reindexed all my products on a single shard, and now the results are ordred in the way I would expect them to, and documents containing identical terms get identical scores. So at least that confirms my theory that the sharding gave me different scores.
With regards to memory and scaling, I will make sure that we monitor the resource consumption, but the current setup has been running for quite some time with no issues. But thank you for the info, in case we do need to separate the cluster from the web servers.
You can also explore using dfs_query_then_fetch. Since your index size is small, you can also leverage this feature to get more accuracy on score. But there will be some overhead involved. So you might want to compare the response times.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.