Eventual consistency of SEARCH on something that I've just indexed

alienmind · April 29, 2016, 8:08am

Hi

I've run into a problem where there's a single process accessing ES to search for some documents, do some local calculations and update them in bulk. A few milliseconds later it is (unlikely) but possible that I need to repeat, do another search and possibly re-update some of the same documents.

The problem is, due to eventual consistency it's perfectly possible that I retrieve an old version of the document and my calculations are bad (I'm not even going into the possible problems of multiple concurrent updates, let's focus on single processing).

I know this is bad design, but what are my options to work around this problem? How can I be sure that I get the latest version of the document always?

The following options that come to my mind is:

Do some local doc version controlling, and retry the _search until I get the latest version (which sounds very inefficient)
Reduce the commit time of ES (or whatever this is called) so it's more unlikely for this to happen (at the cost of performance)
Move all my calculations server side (do massive _updates_by_query) where the server side scripting does all the magic AND ensure that retry_on_conflict is set so the final result is consistent.

I'm tempted to follow my gut and go straight to Option 3 - move my code server side and rely on retry_on_conflict - but perhaps there are other options that I'm missing out?

Appreciate any help! Thanks!

dadoonet · April 29, 2016, 8:26am

When you run a GET, it will be consistent and you will get the right version.
Not the same story if you use SEARCH though.

Why do you think it won't be consistent with GET?

alienmind · April 29, 2016, 8:34am

I'm updating a set of documents, and the conditions are variable, so GET it's not an option (I actually do need to do a SEARCH) but it's perfectly possible that these searches retrieve an older version of the document.

alienmind · April 29, 2016, 8:40am

Let me rephrase the problem: on the first operation I retrieve a set of documents from multiple shards, on the second operation I retrieve another set of documents which might intersect with the first set that I've just updated.

I need to massively update those documents and my calculations will be wrong if I don't rely on the latest versions of those documents ( from what you suggested, looks like GET will get from the log of the primary shard so the latest version is available, but SEARCH will not as it needs to distribute the search, am I right?)

Thank you for your response

dadoonet · April 29, 2016, 8:52am

If your index has been refreshed, you can use preference to search in primaries only.

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_search_options.html#_preference

Might help.

alienmind · April 29, 2016, 9:10am

Thank you @dadoonet really appreciate your help on this.

alienmind · April 29, 2016, 9:12am

By the way, can you point me where can I find information on how retry_on_conflict will detect what is actually a conflict? Was anyhow my idea not even possible for the given scenario?

dadoonet · April 29, 2016, 10:32am

Elasticsearch keeps internal version number each time you change a document.

Let say you index a doc.

"foo": "bar1"

It gets version 1

You update it,

"foo": "bar2"

It gets version 2.

Let say that 2 concurrent user are updating the same doc.

Request of user1 comes first:

"foo": "bar2"
"user": "1"

Version is 3

Request of user2 comes after:

"foo": "bar3"

Version is 4

What will happen in that case?

You will end up with a document which looks like:

"foo": "bar3"

You loose changes made by user1.

To avoid that, you can specify on which version you are working.

Basically, user1 will send its request saying that the expected version for the document is 2.
User2 will do the same.

What will happen?

This will work:

"foo": "bar2"
"user": "1"

Version will be 3

This will fail:

"foo": "bar3"

Because the elasticsearch will be allowed to only modify a document where the version is 2.

When you are using the _update API, elasticsearch can try to deal with that and merge if possible the changes.
This is where this retry_on_conflict option plays a role: Update API | Elasticsearch Guide [2.3] | Elastic

In between the get and indexing phases of the update, it is possible that another process might have already updated the same document. By default, the update will fail with a version conflict exception. The retry_on_conflict parameter controls how many times to retry the update before finally throwing an exception.

I hope it makes sense.

dadoonet · April 29, 2016, 10:33am

You can also read this: https://www.elastic.co/guide/en/elasticsearch/guide/2.x/optimistic-concurrency-control.html

alienmind · May 23, 2016, 9:24am

@dadoonet

I'm afraid the preference=_primary is not helping

The script below should reproduce the case:

GET by key is working OK, but /_search with preference=_primary is not retrieving the latest document

#!/bin/bash - 
URL="http://localhost:9200"
#PREFERENCE=
PREFERENCE="?preference=_primary"

echo "Removing test index"
curl -XDELETE $URL/test
echo ""

echo "Creating 1st version (amount=1.0)"
curl -XPUT   $URL/test/AA/1         -d'{ "k" : 1, "amount" : 1.0 }'
echo ""

echo "Retrieving 1st version"
curl -XGET   $URL/test/AA/1
echo ""

echo "Searching 1st version by key"
curl -XGET   $URL/test/AA/_search${PREFERENCE}   -d'{ "filter" : { "term" : { "k" : 1 }}}'
echo ""

echo "Updating amount"
curl -XPOST  $URL/test/AA/1/_update -d'{ "doc" : { "amount" : 2.0 } }'
echo ""

echo "Retrieving 2nd version"
curl -XGET   $URL/test/AA/1
echo ""

echo "Searching 2nd version by key"
curl -XGET   $URL/test/AA/_search${PREFERENCE}   -d'{ "filter" : { "term" : { "k" : 1 }}}'
echo ""

#echo "Removing existing documents"
#curl -XDELETE $URL/test/AA/1
#echo ""

Running the script will throw:

Removing test index
{"acknowledged":true}
Creating 1st version (amount=1.0)
{"_index":"test","_type":"AA","_id":"1","_version":1,"_shards":{"total":2,"successful":1,"failed":0},"created":true}
Retrieving 1st version
{"_index":"test","_type":"AA","_id":"1","_version":1,"found":true,"_source":{ "k" : 1, "amount" : 1.0 }}
Searching 1st version by key
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
Updating amount
{"_index":"test","_type":"AA","_id":"1","_version":2,"_shards":{"total":2,"successful":1,"failed":0}}
Retrieving 2nd version
{"_index":"test","_type":"AA","_id":"1","_version":2,"found":true,"_source":{"k":1,"amount":2.0}}
Searching 2nd version by key
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

Christian_Dahlqvist · May 23, 2016, 10:20am

Indexed and updated documents are not immediately available for search in Elasticsearch. Each index has a refresh interval, which default to 1s, periodically making indexed or updated documents available for search. The reason for this is that this is an expensive operation, which is best performed periodically on a set of documents rather than individually for each document. As you seem to be running your commends in quick sequence, the refresh has may not had time to happen by the time you search for the document, which is why you do not see any results.

You can explicitly force a refresh on an index, but beware that this will affect performance negatively if performed often.

alienmind · May 23, 2016, 10:36am

Thank you @Christian_Dahlqvist

That's what I've been suspecting, but as GET does get the record from the log, I thought that the preference=_primary would honor that and force to get the record from the log of the primary if it's not still flushed - apparently it does not.

By the way that would be a killer feature for my use case, as I count on having the latest version.

We are currently migrating an application that uses a RDBMS as the backend, and in SQL-land, doing an INSERT followed by a SELECT within the same session will always provide consistent results.

alienmind · May 23, 2016, 10:57am

I've fixed this by using refresh=true in my update request to force a flush of the affected index right after updating it, so it will be available for the subsequent search.

I think this is the cheapest option that I have (using a pre-refresh over a few dozens of possible indexes is much worse)

Thank you all for your help

Topic		Replies	Views
Eventual consistency on simple document put/get Elasticsearch	5	847	July 5, 2017
Read/write consistency Elasticsearch	3	47	February 7, 2025
Update api & consistency Elasticsearch	3	1227	July 6, 2017
Question on read consistency Elasticsearch	10	4291	July 6, 2017
Updating two documents at once - error management Elasticsearch	1	394	December 7, 2019

Eventual consistency of SEARCH on something that I've just indexed

Related topics