We're running Elasticsearch (currently 0.90.6) in what I'd call a
"replicated" architecture: our indexes are quite small (tens of thousands
of documents) and fit easily on a single machine, so we allocate a single
shard per index. However, we make sure that they are replicated to each
node of our cluster. The whole approach ensures that each application
server has its own "local" ES with all data of an index and can keep
working autonomously if others fail. This works alright so far.
Now, we're seeing small but visible score discrepancies between ES nodes,
specifically between the primary shard and the replicas. Using explain, we
found out that the difference is in the maxDocs value. As known and
documented, deleted documents may still contribute to the maxDocs value
(and thus, affect TF-IDF scores). That's not a problem per se.
The problem is rather that maxDocs is different between the primary and the
replica shards (until we restart ES or force a merge using the optimize
call). Depending on whether the primary or a replica is hit with the exact
same query, we get different scores because the maxDocs value is different
by exactly the number of documents that have been deleted previously.
Is there any way to ensure that maxDocs is the same on primary and replica
shards, short of forcing a costly merge?
(Using DFS queries or not makes no difference, as I would expect from my
understanding of them - the index isn't really distributed, it's
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firstname.lastname@example.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b81f3a1e-f6b1-4e72-91ec-a337036d5b18%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.