Issue with binary field with real time GET operations


I'm trying to deal with the following issue.

I'm using the real time get api from java. The json object I store in ES contains a binary field that represents an xml file. This field sometimes will be stored (PUT operation) compressed with GZIP, and sometimes not.

Since 'source' compresses content and I don't want my binary field to be compressed twice, I have excluded this binary field from the "_source" field and, at the same time, I have declared this binary field with "store=yes".

With this configuration, and an index of "mmapfs" type, I perform the following operations:

See PUT and GET operations.

The document ID is a large string (complicated, I know) and, in this case, the xml is sent compressed with GZIP (it's the last field, "cache.response")

The first GET operation gets the document UNCOMPRESSED, whilst the second GET gets the document COMPRESSED (as it should).

From this point, any GET operation returns the document correctly compressed. The version number is always "1" for all operations.

But this DOES NOT HAPPEN consistently. After a PUT operation with a GZIP xml, sometimes I get the document uncompressed and sometimes compressed. After some tests, my impression is that just the GET operations I perform immediately after a PUT (within the same second) returns the uncompressed document. Every GET after 1,2,3 or more seconds returns the compressed document.

I don't know what it's happenning, maybe (probably, I'm new on ES) I'm doing something wrong. More information:

  1. Just one node with one index of "mmapfs" type ("disk_idx") and one index of "memory" type ("memory_idx")

  2. One local client:
    this.node = NodeBuilder.nodeBuilder().settings(s).local(true).data(true).client(false).node();
    this.client = this.node.client();

  3. OS version: Linux version 2.6.32-431.3.1.el6.x86_64 ( (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Fri Jan 3 21:39:27 UTC 2014

  4. Apache-Tomcat-7.0.57, ElasticSearch 1.7 embedded into our webapp, java jdk 1.7.0_71.

  5. My mapping.

Any help will be really appreciated.



I'm not sure how ES handles compressed objects like that but I don't believe it decompresses them. Maybe someone else can comment here.

However is there any reason you are using that massive field, that actually appears to be data, as the _id?


I'm using ES as a cache of xml documents. So I need the real time GET api, and this ID is the cache ID built using values from the xml (separated by '|'). I think I can't use another thing as ID if I want to use the real time api. I could transform this ID to an integer using its hashcode (for example) but then I should perform a query using the hashcode and the cache ID (because several cache IDs can be mapped to the same hashcode). And this is a query so I would lose the real time.

But if someone has any idea to improve this, it will be welcome.

Regarding this rare behaviour with the first get after the put, I hope someone can clarify it.