Decrementing a counter concurrently in ElasticSearch using versioning fails


(Dev Jyoti Behera) #1

Hi,

I am using ES 5.5 (AWS ElasticSearch service).

I am trying to use a shared ES document to assign ids to multiple machines which have access to this index.

Basically, the document looks like this:

index/my_type/count
{
"count: 4
}

Assume there are 4 instances here.
I would like to assign to machines, unique ids ranging from 0-3 (including 3).
So, I will need to make at least 4 requests to decrement and update the count field.
In fact, I may need more requests since there may be versioning conflicts.

I could use the script based update method, with which the get and decrement will be atomic:
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/docs-update.html
However, I need the actual value after updating, before any other machine has changed it.
Also, I am unable to run the script given in the link (btw, the index is hosted on AWS ElasticSearch).

So, I came up with a alternate method based on the version mechanism.
This is what I did (pseudocode, actual impl is in Java):

 while (True) {

    doc = get_document("/index/my_type/docCount")
    version = extract_version(doc)
    count = get_count(doc)
    count = count - 1
    response = putDocument("/index/my_type/docCount/version=" + version,   "{\"count\" : "+ count +"}") // updating the count

    if (response.status == 200) {
        refreshIndex("/index")
        assign_machine_id(count)
        break
    }
}

As I understand, with this code, at each stage, either you get a version conflict, in which case you try again, else you were able to decrement the counter, so assign yourself that counter value.

And this worked for a while, until today I noticed that for a setup where
the counter was intially 4,
there were 4 machines,
the ids assigned were 3,2, 1 and -1.

When I checked the document that holds the counter, its version was 6 (it was decremented 5 times after creation, which should have been 4 times).

In the application, the decrement function is called in only one place. So, every machine calls it only once.
If it's called multiple times, that might justify a decrement happening from elsewhere.

It seems like for the machine that got -1, the counter was decremented twice, but the first time ElasticSearch decremented it successfully, it still reported a version conflict, so the machine tried to decrement once more.

Is this possible?
Or am I doing something wrong?
Is there something else I can try?

Sorry for the long message, but, I wanted to give as much context as possible.

Thanks in advance!

  • Dev

(Boaz Leskes) #2

There are some things that can go wrong, including a rare but valid bug in ES . I think in your case it's most likely that you have a disconnect or a time out between your code and ES while the indexing request is still in flight. At that point you simply don't know if that request was successfully completed (and the counter is already changed) or the disconnect happened before the request left the TCP buffers of your local machine and the counter was changed.

To work around this, a better approach is to come up with system that upon error allows you to introspect the document to see if the request came through or not - like having a set of UUIDs which you can check if a uuid was removed or not.

You mention you try to generate unique ids using a counter - I presume this is not really what your doing because a counter doesn't give you that (think that the node that the id 1 disconnects and the counter goes from 4 to 3, the next time a node asks for an id it will get 4 which is wrong). I you can describe what you're trying to do I can try to help.


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.