Can multiple add/delete of document to an index make it inconsistent?

(Ali Syed) #1

For a use-case, I'll need to add and remove multiple documents to an elastic search index. My understanding is that the tf-idf or BM25 scores are affected by the frequencies that are calculated using the postings list (?)... But, if I add and remove many documents in a day, will that affect the document/word statistics?

I've already went though a lot of API's but my untrained eyes could not locate if this is the case, or if there's a way for me to force ElasticSearch to update/recompute the index every day or so...

Any help would be appreciated


(Igor Motov) #2

The IDF portion of the score can be affected by deletions and modifications because we don't consider deletes while calculating IDF.

(Ali Syed) #3

Hi Igor,

Thanks a lot for your answer. Is there any way to remedy this? Like for example, I see a _refresh API... will refreshing help with this?

(Igor Motov) #4

As you add more and more data to elasticsearch, elasticsearch will start automatically merging segments and dropping deletes. This is an automatic process and as a result, on a large index you should have a relatively small number of deletes and pretty close approximation of IDF. So, if you tested the impact on a small index and it looked huge, you should try to test it on a real index to see if the impact is still noticeable. You can force the removal of deleted documents by running force merge with only_expunge_deletes flag, but it's typically not recommended since it will cause additional CPU and disk load, excessive cache invalidation and other issues. I would only do that on a small index where impact of skewed IDF is high and the extra load from force merge is small.

By the way, if you have multiple shards in your index and you really care about correct IDF, you should run your queries with DFS phase. Otherwise. IDF will be calculated on each shard separately instead of calculating it globally.

(Ali Syed) #5

Thanks! :slight_smile:

(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.