Why is the GET API not holding to it's realtime promise?

Elasticsearch Version: 2.4

We have 3 master nodes and 2 data nodes. Our index has 8 shards and 1 replica. Our nodes are hosted on EC2. The below is from our (at the time mostly empty) staging deployment. Our production deployment is identical other than node naming:

We are experiencing this issue in both our Production deployment, and a similar staging deployment. This issue does not occur in our single-node development and testing ES deployments.

Application Flow

In our application, we use a lock-controlled read-update-write pattern to add nested objects to parent documents. A given document can have from 1-60 'tokens' (nested objects) added to it.

We have ~80 queue worker threads which process tokens out of a backlog and add them to their parent documents in the index.

The code flow to process a token works as follows:

  1. Queue worker gets a lock on the document the token belongs to.
  2. Queue worker retrieves the document using the GET API.
  3. Queue worker adds token information to the document.
  4. Queue worker writes the document back into the index
  5. Queue worker releases the lock on the parent document

Once a given queue worker has finished, another may grab the lock on the document and update it with a different token.

Problem

The problem we are having is that the GET API doesn't always return the most recently indexed copy of the document. The GET API returns an outdated copy of the document for 1-3 seconds after the new version is sent to the index.

Here is a GIST of log output demonstrating the issue: https://gist.github.com/brianV/be00761436f219efb8f057e42838e13f

Below is a small snippet:

[2016-12-29 10:43:46] app.INFO: Elastica version check: getDocument() called. Version 52 of 1:69 returned [] []
[2016-12-29 10:43:47] app.INFO: Elastica version check: Called addDocument() for 1:69 [] []
[2016-12-29 10:43:47] app.INFO: Elastica version check: getDocument() called. Version 52 of 1:69 returned [] []
[2016-12-29 10:43:47] app.INFO: Elastica version check: Called addDocument() for 1:69 [] []
[2016-12-29 10:43:47] app.INFO: Elastica version check: getDocument() called. Version 52 of 1:69 returned [] []
[2016-12-29 10:43:47] app.INFO: Elastica version check: Called addDocument() for 1:69 [] []
[2016-12-29 10:43:48] app.INFO: Elastica version check: getDocument() called. Version 55 of 1:69 returned [] []
[2016-12-29 10:43:48] app.INFO: Elastica version check: Called addDocument() for 1:69 [] []

The getDocument() call is an Elastica (http://elastica.io/) wrapper around the GET API. We output the version retuned from Elastic.
The addDocument() call is an Elastica wrapper around PUT requests into the index.

The steps in the log above are as follows:

  1. Version 52 is returned in the first getDocument() call. It's updated by a worker and submitted back into the index (which creates version 53).
  2. getDocument() is called by another worker, which returns version 52 again rather than version 53. It updates it and indexes to create version 54 (overwriting the changes that went into 53 as 54 is based on 52).
  3. The next worker does the same to create version 55 (overwriting the version 54 changes).
  4. Finally, version 55 is returned.

This pattern will continue - an outdated version is returned by the GET API for 1-3 seconds after a new version is committed. It's like the version returned by the GET API is only updated every few seconds rather than in real time.

My immediate thought was that GET requests are getting cached somewhere in between, but we've eliminated that possibility.

I can guarantee that the lock is working. I verified this by making each worker thread sleep for 5 seconds before releasing the lock. This also had the side effect of making the system worked properly - each getDocument() request returned the next version, presumably because the threads were working slower than the 1-3 second delay we are seeing. Unfortunately, this made performance miserable for the project.

I'd really appreciate any thoughts as to what's going on here!

I'm not familiar with this client so can't directly help there.
Can you replicate this with a simple curl example?

Does the Elastica client can wait for an "acknowledge" when you put the new version, before releasing the lock ?

Question: How many nodes in your cluster ?

It appears something somewhere was caching the GET requests, although we aren't sure why. We set up an internal DNS entry that routed the traffic within AWS, and the issue evaporated.

It's not clear yet where they were getting cached or why, but that's the apparent case.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.