Reading and writing the same document too fast --> data loss


(Peter Webber) #1

Hello,

We store texts in Elasticsearch, where each text has an ID attached. Every
day we run a batch job to add new documents. Sometimes a new document
consists of a text that we already have in the database, but it has a
different ID. In such a case we need to read the document that's alredy
indexed and add the new ID to this existing document.

Now consider the following scenario:

DocumentA with text "Hello" and ID #1 is indexed.

We now add documentB with text "Hello" and ID #2
To do this we find documentA which has the same text, read it, add ID #2
and save it again.

Then we want to add documentC with text "Hello" and ID #3
To do this we find documentA which has the same text, read it, add ID #3
and save it again.

What do we get as a result? It's a bit unpredictable but quite often:
DocumentA with text "Hello" and IDs #1 and #3. This means ID #2 is missing.

It seems like the first write (with ID #2) has not been completed, when the
second read is done.

I guess we are not the first to encounter these issues. What are common
strategies to deal with this?

Regards
Peter

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3da9cec9-e48d-4c55-b7ed-330235322c4f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Michael McCandless) #2

Maybe you need to use versioning, to ensure the 3rd write doesn't undo
(overwrite) the changes of the 2nd write?

See
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/optimistic-concurrency-control.html

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jul 11, 2014 at 6:24 AM, Peter Webber peterwebber321@gmail.com
wrote:

Hello,

We store texts in Elasticsearch, where each text has an ID attached. Every
day we run a batch job to add new documents. Sometimes a new document
consists of a text that we already have in the database, but it has a
different ID. In such a case we need to read the document that's alredy
indexed and add the new ID to this existing document.

Now consider the following scenario:

DocumentA with text "Hello" and ID #1 is indexed.

We now add documentB with text "Hello" and ID #2
To do this we find documentA which has the same text, read it, add ID #2
and save it again.

Then we want to add documentC with text "Hello" and ID #3
To do this we find documentA which has the same text, read it, add ID #3
and save it again.

What do we get as a result? It's a bit unpredictable but quite often:
DocumentA with text "Hello" and IDs #1 and #3. This means ID #2 is missing.

It seems like the first write (with ID #2) has not been completed, when
the second read is done.

I guess we are not the first to encounter these issues. What are common
strategies to deal with this?

Regards
Peter

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/3da9cec9-e48d-4c55-b7ed-330235322c4f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/3da9cec9-e48d-4c55-b7ed-330235322c4f%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smRe7nZ88G4GfkJK5-E5X-3yBtMhx9B705p8EzzHLqdcinQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Peter Webber) #3

Thanks for the reply. That seems like it could work. At least on the first
look.

But at second look:
How do I know the correct version?

If I add documentC, ES will return documentA, in the state before the ID
from documentB was added, so I guess I also get the wrong version returned
from ES as well.

Does this mean that I need to keep track of versions outside of ES? That's
going to be difficult.

Note: In our setup, we the amount of data is a concern, not processing time
while adding new documents. Can't I just tell Elasticsearch to wait after
an insert until it is fully processed?

Thanks!
Michael

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d657dbc3-29c8-4f19-bea0-e6d581f987eb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Michael McCandless) #4

You don't need to add your own external versions; just use ES's internal
versions (starts at 1 when you create the doc, and increments each time
it's updated).

You know the correct version because you retrieved the current doc first
from ES, which returns its current version. Then you make your change,
submit it back for re-indexing but that re-indexing can fail if another
thread updated in the meantime, and then you retry.

Alternatively, you could maybe use the update API that takes a script, if
you can express "add my new ID" as a script? See
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-update.html

Mike McCandless

http://blog.mikemccandless.com

On Sat, Jul 12, 2014 at 5:51 AM, Peter Webber peterwebber321@gmail.com
wrote:

Thanks for the reply. That seems like it could work. At least on the first
look.

But at second look:
How do I know the correct version?

If I add documentC, ES will return documentA, in the state before the ID
from documentB was added, so I guess I also get the wrong version returned
from ES as well.

Does this mean that I need to keep track of versions outside of ES? That's
going to be difficult.

Note: In our setup, we the amount of data is a concern, not processing
time while adding new documents. Can't I just tell Elasticsearch to wait
after an insert until it is fully processed?

Thanks!
Michael

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d657dbc3-29c8-4f19-bea0-e6d581f987eb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/d657dbc3-29c8-4f19-bea0-e6d581f987eb%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD7smReML73mo0-wwrXG_OOb1a_onYTtW0dWMvp1pLgtt%2Bdc2g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(smonasco-2) #5

Sounds like you're searching and using the results to reindex the doc. Search is not real time, but get is. So you could get the document after the search. Also make sure your application isn't stepping on itself and is either in serial or has some sort of lock idea.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bc2ceabb-0c7d-445b-82d5-25568b08cb0e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6