Issue storing binary fields in 6.2.2

My application stores long text (>32K) values in binary field. The code works well in 5.2.1. However when I upgraded to 6.2.2 I am seeing strange behavior. Blob of one document gets attached to other documents.

Here are steps to reproduce

  1. Create index
    PUT /blob_test
    {
    "settings": { "number_of_shards": 1, "number_of_replicas": 0 },
    "mappings": {
    "_doc": {
    "dynamic": "false",
    "_source": { "enabled": true },
    "properties": {
    "blob": { "type" : "binary", "store" : false, "doc_values" : true }
    }
    }
    }
    }

  2. Load data
    public static void main(String[] args) {
    try (RestClient restClient = RestClient.builder(new HttpHost("localhost", 9200, "http")).build()) {
    StringBuilder reqBuf = new StringBuilder();
    for (int i = 0; i< 10; i++) {
    String id = "id_" + i;
    //add id to make blob distinct
    String blob = Base64.getEncoder().encodeToString((id + " and some text").getBytes());
    reqBuf.append("{"index" : {"_id" : "" + id + ""}}\n");
    reqBuf.append("{"blob" : "" + blob + ""}\n");
    }
    HttpEntity entity = new NStringEntity(reqBuf.toString(), ContentType.APPLICATION_JSON);
    restClient.performRequest("POST", "/blob_test/doc/_bulk",
    Collections.emptyMap(), entity);
    } catch (Exception e) {
    e.printStackTrace();
    }
    System.exit(0);
    }

  3. Run Query
    POST /blob_test/_search?pretty
    {
    "_source": true,
    "docvalue_fields": ["blob"]
    }

Base64 encoded value in _source and fields differ for many documents. Value in _source is always correct but value in fields is wrong.

Same steps work in 5.2.1. It does not like _doc as type. But changing to doc works.

Best regards
Vinayak

Why using binary if it's a text that you want to search for?

Even though I have used small text in the example actual length is > 32K. Keyword field with docvalue true throws exception DocValuesField is too large, must be <= 32766.

Also, I don't search on this field. It's used only for display.

Vinayak

I see.

Why not using:

"blob": { "type" : "binary" }

And then use source filtering to just read the blob field from the _source?

Thanks David.
That's a workaround I thought too. It will work for me in one index which is small but not for other which is significantly large (in TBs).

Large index data never gets updated. Both indices have lot of other fields that are indexed. I set _source off in mappings and only rely on doc values. Even with best compression, we get good amount of saving by turning off "_source" in the mappings. I kept _source in this example for two reasons

  1. demonstrate the issue that document reaches Elasticseach correctly but something fails during indexing or while reading it back.
  2. Verify issue easily. It's not easy to spot difference in base64 encoded values.

Since my current approach works well in 5.2.1, I want make sure I have not missed something in mappings or in query specific to 6.2.2. Secondly using binary field with DocValues approach is not an anti-pattern or not being considered for deprecation.

Also, if it turns out to be 6.2.2 defect it would be nice to know if it can affect anything else.

Thanks
Vinayak

TBH I'm not sure it's a very good thing to store binaries in elasticsearch.
Anyway, you can probably do something like:

{
  "data": {
    "type": "binary",
    "store": true
  }
}

instead.

Thanks David.

TBH I'm not sure it's a very good thing to store binaries in elasticsearch.

It's not binary like a jpg. Raw value is actually string longer than 32K. We perform text analysis on this value. Since it's longer than 32K and I need to retrive complete string back, I wasn't able to store it as keyword.

I use two fields one of type text (used for searching) and one binary (for retrieving).

Coming back to your solution, store:true worked with a small test. I also tried storing text string larger than 32K. It worked too. I feel store:true is better option compared to enabling _source. I will try with the real data.

I understand stored fields are slower to retrieve than docvalue fields (What's the difference between "store" and "doc_values" in field properties?). That should be okay.
But does it have any implications on heap utilization compared to doc_values? I don't search, sort, aggregate on binary field.

Thanks
Vinayak

If you just store, I think that store is better than doc_values.
doc_values is meant for sort and aggs.

But again, I'd store the binary content on a FS, distributed FS like HDFS, CouchDB, or whatever... And just put the link to the datastore in elasticsearch.

You'll pay a price in IO at some point when merging of segments will happen.

My 2 cents

There is a bug here. I opened https://github.com/elastic/elasticsearch/issues/29565.

I agree that store would be a better fit than doc values for your use-case. Doc values might look faster but they might not scale as well when you have lots of data and want to request multiple fields per document. This is due to the fact that stored fields have all values for a given document in a single place while doc values have conceptually one different file per field.

1 Like

A bugfix has been pushed and will be available in 6.3: https://github.com/elastic/elasticsearch/pull/29567.

David and Adrien,

I tested with a larger dataset. Using stored value instead of docvalue solved the problem.

Thanks for the solution.

Vinayak

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.