Problem with lucene score is thrashing our pagination

We have noticed that sometimes we get duplicated search results after moving to the next page. After investigating this issue we realized that the Lucene scores are changing really fast all the time and that is causing the problem since some results sometimes appear on the previous page and sometimes in the next and sometimes in both. Can you please:

  • Confirm if is this is expected
  • If this is an expected behavior, Can you advise on some workarounds to avoid the problem?
  • If this is not expected, What are the next steps to report and fix the problem?

Customer: MindTouch Inc.
Cluster id: b6353d

Thanks in advance

Cheers,
Manuel.

@Manuel_Sugawara1 I've moved the post to the #elasticsearch topic as this is more specific to the workings of Elasticsearch.

Christian

This is (generally) expected behavior. Lucene scores are influenced by things such as the term and document frequencies. If you are actively indexing data, these frequencies can change and can sometimes be seen as artifacts.

Also, even if the term/doc freqs don't change noticeably, the presence of new documents can affect ranking. E.g. user asks for 0-10, a new document is indexed which would have ranked #10, the user now asks for results 11-20. At this point, the "old" #10 has been bumped to #11 and shows up again.

The best/easiest way to deal with these are to ask for bigger pages and internally buffer the results. So for example, a user generally only searches a few pages before refining their query, so perhaps ask for 0-50 so there are 50 results queued in your app.

Alternatively, you could use scroll windows to maintain "consistent pagination contexts", but this can be expensive if you have many users executing simultaneous queries. I would avoid this option tbh.

Edit: You could also save the time the "search session" started, then apply filters to only allow documents that were indexed before the session started to prevent new docs from affecting the search. Note that this would require docs to store their creation time, and isn't robust against updates or deletions.

Thanks, totally makes sense.

Cheers,
Manuel.