Hi,
I am trying to use Bulk method using c# NEST client. Example
Lets say I have indexed 1000 documents first time.
Second time I have 800 documents. But I don't know whether these are new or old. I have to replace existing 1000 with these 800.
When using Bulk method, it is creating new documents if there are no matching ids and updating if any matching documents found.
My problem is how to delete the 200 unmatched documents.
There is Delete API but as I have mentioned earlier I don't have the ids which are deleted.
I could query elastic search to give me the ids other than these 800 ids and then delete it, but I am not sure that's the best way to do this. OR
Do you think using Index Aliases feature will solve this problem. If I use Index Aliases I have to delete the old index after pointing to the new index but does this affect the search scoring?
Could you please help me to solve this problem.
use an alias and versioned indices e.g. alias documents pointing to single index documents-v1
index 1000 documents into documents-v1 (using either the index name or alias)
index 800 documents into new index documents-v2
remove index from pointing to documents-v1 and point it to documents-v2
delete index documents-v1
For 800-1000 documents, a single primary shard can be used (with replicas, for redundancy), so scoring will be based on the entire document corpus in each case.
Thank you very much for replying back.
Just to be sure please Example
From 1000 documents each document has one property called PropertyA.
After indexing and searching for some time the PropertA score is 0.9.
So after indexing second time (800 documents) using this approach, the PropertyA score will be 0.9? Is that right please?
And in future if i have 25000 plus documents does this still be the case please.
That is not correct; scores are calculated relative to the document corpus, by default using BM25. A component of document scoring is the inverse of the frequency of a term within the entire document corpus, so it's highly probable that scores calculated for the 1000 documents will be different than those calculated for the 800 documents.
Yes, of course the score won't be same. My thinking was whether this approach has any affects or not on the original score but as you have mentioned depends on A component of document scoring is the inverse of the frequency of a term within the entire document corpus.
I will read those documents.
Thank you very much for pointing me in right direction.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.