Document versioning hurting search results?


#1

Hi,

We have an index with client data and search on a field containing companyName. When a client document is updated without any changes to the companyName the search results change.

The explain api shows that the number of documents & term frequencies change. It seems that the old version of the document is also included in the relevance scoring. This significantly influences the search results (terms in company names become less unique).

Is there anyway I can exclude old document versions in the search??

Regards,

Winston


(Mark Walkom) #2

When you change the version of a document in ES you are overwriting the old one entirely, then old ones won't exist.

Or are you doing your own versioning?


(Mark Harwood) #3

As Mark says, an update is a delete followed by an insert. However the deleted document is a soft delete and is not fully removed until background merge operations reorganise the segment files and purge deleted docs.
A consequence of this is that the frequencies reported for words in scoring include these deleted docs - it is a performance optimisation that we don't continually recompute these numbers.
This may have a big impact on a tiny index where you might be testing behaviors but in a larger index with new additions constantly triggering merge operations it tends to be less of an issue.
If you really want to force a merge operation to purge deletes look at the optimize function but be advised this is can be hugely expensive to run so read the documentation for this first!


#4

We do not use any custom versioning scheme. Everything is pretty standard.

We have a fairly small index of around 300k documents with company info. The users search for a company, select & modify it. When they save their modifications the elastic search index is updated with the new info. After the update the search results (ordering) has changed. This is confusing for the users.

I tried to do an explicit optimize, but this did not reduce the document count at all. Even with setting index.merge.policy.expunge_deletes_allowed to zero.


#5

Adding:

.setMaxNumSegments(1)

seems to work in combination with:

index.merge.policy.expunge_deletes_allowed=0

Merge timings for a single record fluctuates between 10ms and a few seconds. Not sure what causes the huge differences.


(Mark Walkom) #6

Scoring is based on whatever other docs exist in the same shard, that could be why.


(Mark Harwood) #7

In a system undergoing constant change you can't expect stability in the rankings due to the constantly changing frequencies of words and the numbers of documents.

A full optimize (setting maxNumSegments to 1) after every change is not a scalable approach.

Lucene never updates a file it has written - it only creates new ones. Over time you obviously ended up with a lot of files and so the background merge logic continually glues older files together into new files (minus the deleted records).
An optimize call with maxNumSegments=1 will effectively rewrite your entire index into new "optimized" files minus the deletes. That's quite a cost to bear for every update. It is for these reasons that Lucene now calls it's "optimize" function "forceMerge" as it sounds a lot less tempting for users to run. Typically we would only advocate using this if no more additions are expected on an index e.g. in a system with index-per-day strategy at the end of a day.


(system) #8