Document versioning hurting search results?

winston · June 11, 2015, 8:16am

Hi,

We have an index with client data and search on a field containing companyName. When a client document is updated without any changes to the companyName the search results change.

The explain api shows that the number of documents & term frequencies change. It seems that the old version of the document is also included in the relevance scoring. This significantly influences the search results (terms in company names become less unique).

Is there anyway I can exclude old document versions in the search??

Regards,

Winston

warkolm · June 13, 2015, 4:57am

When you change the version of a document in ES you are overwriting the old one entirely, then old ones won't exist.

Or are you doing your own versioning?

Mark_Harwood · June 13, 2015, 8:10am

As Mark says, an update is a delete followed by an insert. However the deleted document is a soft delete and is not fully removed until background merge operations reorganise the segment files and purge deleted docs.
A consequence of this is that the frequencies reported for words in scoring include these deleted docs - it is a performance optimisation that we don't continually recompute these numbers.
This may have a big impact on a tiny index where you might be testing behaviors but in a larger index with new additions constantly triggering merge operations it tends to be less of an issue.
If you really want to force a merge operation to purge deletes look at the optimize function but be advised this is can be hugely expensive to run so read the documentation for this first!

winston · June 15, 2015, 7:01am

We do not use any custom versioning scheme. Everything is pretty standard.

We have a fairly small index of around 300k documents with company info. The users search for a company, select & modify it. When they save their modifications the elastic search index is updated with the new info. After the update the search results (ordering) has changed. This is confusing for the users.

I tried to do an explicit optimize, but this did not reduce the document count at all. Even with setting index.merge.policy.expunge_deletes_allowed to zero.

winston · June 15, 2015, 7:51am

Adding:

.setMaxNumSegments(1)

seems to work in combination with:

index.merge.policy.expunge_deletes_allowed=0

Merge timings for a single record fluctuates between 10ms and a few seconds. Not sure what causes the huge differences.

warkolm · June 15, 2015, 8:02am

Scoring is based on whatever other docs exist in the same shard, that could be why.

Mark_Harwood · June 15, 2015, 8:45am

In a system undergoing constant change you can't expect stability in the rankings due to the constantly changing frequencies of words and the numbers of documents.

A full optimize (setting maxNumSegments to 1) after every change is not a scalable approach.

Lucene never updates a file it has written - it only creates new ones. Over time you obviously ended up with a lot of files and so the background merge logic continually glues older files together into new files (minus the deleted records).
An optimize call with maxNumSegments=1 will effectively rewrite your entire index into new "optimized" files minus the deletes. That's quite a cost to bear for every update. It is for these reasons that Lucene now calls it's "optimize" function "forceMerge" as it sounds a lot less tempting for users to run. Typically we would only advocate using this if no more additions are expected on an index e.g. in a system with index-per-day strategy at the end of a day.

Topic		Replies	Views
Versioning of Content Elasticsearch	3	943	July 5, 2017
Delete support Elasticsearch	5	312	July 6, 2017
Searching while indexing Elasticsearch	4	837	July 6, 2017
Updating documents without affecting search Elasticsearch	13	1284	March 14, 2019
Remove old document versions Elasticsearch	3	2647	June 21, 2018

Document versioning hurting search results?

Related topics