We store texts in Elasticsearch, where each text has an ID attached. Every
day we run a batch job to add new documents. Sometimes a new document
consists of a text that we already have in the database, but it has a
different ID. In such a case we need to read the document that's alredy
indexed and add the new ID to this existing document.
Now consider the following scenario:
DocumentA with text "Hello" and ID #1 is indexed.
We now add documentB with text "Hello" and ID #2
To do this we find documentA which has the same text, read it, add ID #2
and save it again.
Then we want to add documentC with text "Hello" and ID #3
To do this we find documentA which has the same text, read it, add ID #3
and save it again.
What do we get as a result? It's a bit unpredictable but quite often:
DocumentA with text "Hello" and IDs #1 and #3. This means ID #2 is missing.
It seems like the first write (with ID #2) has not been completed, when the
second read is done.
I guess we are not the first to encounter these issues. What are common
strategies to deal with this?
We store texts in Elasticsearch, where each text has an ID attached. Every
day we run a batch job to add new documents. Sometimes a new document
consists of a text that we already have in the database, but it has a
different ID. In such a case we need to read the document that's alredy
indexed and add the new ID to this existing document.
Now consider the following scenario:
DocumentA with text "Hello" and ID #1 is indexed.
We now add documentB with text "Hello" and ID #2
To do this we find documentA which has the same text, read it, add ID #2
and save it again.
Then we want to add documentC with text "Hello" and ID #3
To do this we find documentA which has the same text, read it, add ID #3
and save it again.
What do we get as a result? It's a bit unpredictable but quite often:
DocumentA with text "Hello" and IDs #1 and #3. This means ID #2 is missing.
It seems like the first write (with ID #2) has not been completed, when
the second read is done.
I guess we are not the first to encounter these issues. What are common
strategies to deal with this?
Thanks for the reply. That seems like it could work. At least on the first
look.
But at second look:
How do I know the correct version?
If I add documentC, ES will return documentA, in the state before the ID
from documentB was added, so I guess I also get the wrong version returned
from ES as well.
Does this mean that I need to keep track of versions outside of ES? That's
going to be difficult.
Note: In our setup, we the amount of data is a concern, not processing time
while adding new documents. Can't I just tell Elasticsearch to wait after
an insert until it is fully processed?
You don't need to add your own external versions; just use ES's internal
versions (starts at 1 when you create the doc, and increments each time
it's updated).
You know the correct version because you retrieved the current doc first
from ES, which returns its current version. Then you make your change,
submit it back for re-indexing but that re-indexing can fail if another
thread updated in the meantime, and then you retry.
Alternatively, you could maybe use the update API that takes a script, if
you can express "add my new ID" as a script? See
Thanks for the reply. That seems like it could work. At least on the first
look.
But at second look:
How do I know the correct version?
If I add documentC, ES will return documentA, in the state before the ID
from documentB was added, so I guess I also get the wrong version returned
from ES as well.
Does this mean that I need to keep track of versions outside of ES? That's
going to be difficult.
Note: In our setup, we the amount of data is a concern, not processing
time while adding new documents. Can't I just tell Elasticsearch to wait
after an insert until it is fully processed?
Sounds like you're searching and using the results to reindex the doc. Search is not real time, but get is. So you could get the document after the search. Also make sure your application isn't stepping on itself and is either in serial or has some sort of lock idea.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.