Eventual consistency of SEARCH on something that I've just indexed

Hi

I've run into a problem where there's a single process accessing ES to search for some documents, do some local calculations and update them in bulk. A few milliseconds later it is (unlikely) but possible that I need to repeat, do another search and possibly re-update some of the same documents.

The problem is, due to eventual consistency it's perfectly possible that I retrieve an old version of the document and my calculations are bad (I'm not even going into the possible problems of multiple concurrent updates, let's focus on single processing).

I know this is bad design, but what are my options to work around this problem? How can I be sure that I get the latest version of the document always?

The following options that come to my mind is:

  1. Do some local doc version controlling, and retry the _search until I get the latest version (which sounds very inefficient)

  2. Reduce the commit time of ES (or whatever this is called) so it's more unlikely for this to happen (at the cost of performance)

  3. Move all my calculations server side (do massive _updates_by_query) where the server side scripting does all the magic AND ensure that retry_on_conflict is set so the final result is consistent.

I'm tempted to follow my gut and go straight to Option 3 - move my code server side and rely on retry_on_conflict - but perhaps there are other options that I'm missing out?

Appreciate any help! Thanks!

When you run a GET, it will be consistent and you will get the right version.
Not the same story if you use SEARCH though.

Why do you think it won't be consistent with GET?

I'm updating a set of documents, and the conditions are variable, so GET it's not an option (I actually do need to do a SEARCH) but it's perfectly possible that these searches retrieve an older version of the document.

Let me rephrase the problem: on the first operation I retrieve a set of documents from multiple shards, on the second operation I retrieve another set of documents which might intersect with the first set that I've just updated.

I need to massively update those documents and my calculations will be wrong if I don't rely on the latest versions of those documents ( from what you suggested, looks like GET will get from the log of the primary shard so the latest version is available, but SEARCH will not as it needs to distribute the search, am I right?)

Thank you for your response

If your index has been refreshed, you can use preference to search in primaries only.

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_search_options.html#_preference

Might help.

1 Like

Thank you @dadoonet really appreciate your help on this.

By the way, can you point me where can I find information on how retry_on_conflict will detect what is actually a conflict? Was anyhow my idea not even possible for the given scenario?

Elasticsearch keeps internal version number each time you change a document.

Let say you index a doc.

"foo": "bar1"

It gets version 1

You update it,

"foo": "bar2"

It gets version 2.

Let say that 2 concurrent user are updating the same doc.

Request of user1 comes first:

"foo": "bar2"
"user": "1"

Version is 3

Request of user2 comes after:

"foo": "bar3"

Version is 4

What will happen in that case?

You will end up with a document which looks like:

"foo": "bar3"

You loose changes made by user1.

To avoid that, you can specify on which version you are working.

Basically, user1 will send its request saying that the expected version for the document is 2.
User2 will do the same.

What will happen?

This will work:

"foo": "bar2"
"user": "1"

Version will be 3

This will fail:

"foo": "bar3"

Because the elasticsearch will be allowed to only modify a document where the version is 2.

When you are using the _update API, elasticsearch can try to deal with that and merge if possible the changes.
This is where this retry_on_conflict option plays a role: Update API | Elasticsearch Guide [2.3] | Elastic

In between the get and indexing phases of the update, it is possible that another process might have already updated the same document. By default, the update will fail with a version conflict exception. The retry_on_conflict parameter controls how many times to retry the update before finally throwing an exception.

I hope it makes sense.

1 Like

You can also read this: https://www.elastic.co/guide/en/elasticsearch/guide/2.x/optimistic-concurrency-control.html

1 Like

@dadoonet

I'm afraid the preference=_primary is not helping

The script below should reproduce the case:

GET by key is working OK, but /_search with preference=_primary is not retrieving the latest document

#!/bin/bash - 
URL="http://localhost:9200"
#PREFERENCE=
PREFERENCE="?preference=_primary"

echo "Removing test index"
curl -XDELETE $URL/test
echo ""

echo "Creating 1st version (amount=1.0)"
curl -XPUT   $URL/test/AA/1         -d'{ "k" : 1, "amount" : 1.0 }'
echo ""

echo "Retrieving 1st version"
curl -XGET   $URL/test/AA/1
echo ""

echo "Searching 1st version by key"
curl -XGET   $URL/test/AA/_search${PREFERENCE}   -d'{ "filter" : { "term" : { "k" : 1 }}}'
echo ""

echo "Updating amount"
curl -XPOST  $URL/test/AA/1/_update -d'{ "doc" : { "amount" : 2.0 } }'
echo ""

echo "Retrieving 2nd version"
curl -XGET   $URL/test/AA/1
echo ""

echo "Searching 2nd version by key"
curl -XGET   $URL/test/AA/_search${PREFERENCE}   -d'{ "filter" : { "term" : { "k" : 1 }}}'
echo ""

#echo "Removing existing documents"
#curl -XDELETE $URL/test/AA/1
#echo ""

Running the script will throw:

Removing test index
{"acknowledged":true}
Creating 1st version (amount=1.0)
{"_index":"test","_type":"AA","_id":"1","_version":1,"_shards":{"total":2,"successful":1,"failed":0},"created":true}
Retrieving 1st version
{"_index":"test","_type":"AA","_id":"1","_version":1,"found":true,"_source":{ "k" : 1, "amount" : 1.0 }}
Searching 1st version by key
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
Updating amount
{"_index":"test","_type":"AA","_id":"1","_version":2,"_shards":{"total":2,"successful":1,"failed":0}}
Retrieving 2nd version
{"_index":"test","_type":"AA","_id":"1","_version":2,"found":true,"_source":{"k":1,"amount":2.0}}
Searching 2nd version by key
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

Indexed and updated documents are not immediately available for search in Elasticsearch. Each index has a refresh interval, which default to 1s, periodically making indexed or updated documents available for search. The reason for this is that this is an expensive operation, which is best performed periodically on a set of documents rather than individually for each document. As you seem to be running your commends in quick sequence, the refresh has may not had time to happen by the time you search for the document, which is why you do not see any results.

You can explicitly force a refresh on an index, but beware that this will affect performance negatively if performed often.

1 Like

Thank you @Christian_Dahlqvist

That's what I've been suspecting, but as GET does get the record from the log, I thought that the preference=_primary would honor that and force to get the record from the log of the primary if it's not still flushed - apparently it does not.

By the way that would be a killer feature for my use case, as I count on having the latest version.

We are currently migrating an application that uses a RDBMS as the backend, and in SQL-land, doing an INSERT followed by a SELECT within the same session will always provide consistent results.

I've fixed this by using refresh=true in my update request to force a flush of the affected index right after updating it, so it will be available for the subsequent search.

I think this is the cheapest option that I have (using a pre-refresh over a few dozens of possible indexes is much worse)

Thank you all for your help