Issue storing binary fields in 6.2.2

Vinayak_Sapre · April 17, 2018, 6:08am

My application stores long text (>32K) values in binary field. The code works well in 5.2.1. However when I upgraded to 6.2.2 I am seeing strange behavior. Blob of one document gets attached to other documents.

Here are steps to reproduce

Create index
PUT /blob_test
{
"settings": { "number_of_shards": 1, "number_of_replicas": 0 },
"mappings": {
"_doc": {
"dynamic": "false",
"_source": { "enabled": true },
"properties": {
"blob": { "type" : "binary", "store" : false, "doc_values" : true }
}
}
}
}
Load data
public static void main(String[] args) {
try (RestClient restClient = RestClient.builder(new HttpHost("localhost", 9200, "http")).build()) {
StringBuilder reqBuf = new StringBuilder();
for (int i = 0; i< 10; i++) {
String id = "id_" + i;
//add id to make blob distinct
String blob = Base64.getEncoder().encodeToString((id + " and some text").getBytes());
reqBuf.append("{"index" : {"_id" : "" + id + ""}}\n");
reqBuf.append("{"blob" : "" + blob + ""}\n");
}
HttpEntity entity = new NStringEntity(reqBuf.toString(), ContentType.APPLICATION_JSON);
restClient.performRequest("POST", "/blob_test/doc/_bulk",
Collections.emptyMap(), entity);
} catch (Exception e) {
e.printStackTrace();
}
System.exit(0);
}
Run Query
POST /blob_test/_search?pretty
{
"_source": true,
"docvalue_fields": ["blob"]
}

Base64 encoded value in _source and fields differ for many documents. Value in _source is always correct but value in fields is wrong.

Same steps work in 5.2.1. It does not like _doc as type. But changing to doc works.

Best regards
Vinayak

dadoonet · April 17, 2018, 7:10am

Why using binary if it's a text that you want to search for?

Vinayak_Sapre · April 17, 2018, 7:34am

Even though I have used small text in the example actual length is > 32K. Keyword field with docvalue true throws exception DocValuesField is too large, must be <= 32766.

Also, I don't search on this field. It's used only for display.

Vinayak

dadoonet · April 17, 2018, 7:54am

I see.

Why not using:

"blob": { "type" : "binary" }

And then use source filtering to just read the blob field from the _source?

Vinayak_Sapre · April 17, 2018, 1:07pm

Thanks David.
That's a workaround I thought too. It will work for me in one index which is small but not for other which is significantly large (in TBs).

Large index data never gets updated. Both indices have lot of other fields that are indexed. I set _source off in mappings and only rely on doc values. Even with best compression, we get good amount of saving by turning off "_source" in the mappings. I kept _source in this example for two reasons

demonstrate the issue that document reaches Elasticseach correctly but something fails during indexing or while reading it back.
Verify issue easily. It's not easy to spot difference in base64 encoded values.

Since my current approach works well in 5.2.1, I want make sure I have not missed something in mappings or in query specific to 6.2.2. Secondly using binary field with DocValues approach is not an anti-pattern or not being considered for deprecation.

Also, if it turns out to be 6.2.2 defect it would be nice to know if it can affect anything else.

Thanks
Vinayak

dadoonet · April 17, 2018, 2:00pm

TBH I'm not sure it's a very good thing to store binaries in elasticsearch.
Anyway, you can probably do something like:

{
  "data": {
    "type": "binary",
    "store": true
  }
}

instead.

Vinayak_Sapre · April 17, 2018, 4:00pm

Thanks David.

TBH I'm not sure it's a very good thing to store binaries in elasticsearch.

It's not binary like a jpg. Raw value is actually string longer than 32K. We perform text analysis on this value. Since it's longer than 32K and I need to retrive complete string back, I wasn't able to store it as keyword.

I use two fields one of type text (used for searching) and one binary (for retrieving).

Coming back to your solution, store:true worked with a small test. I also tried storing text string larger than 32K. It worked too. I feel store:true is better option compared to enabling _source. I will try with the real data.

I understand stored fields are slower to retrieve than docvalue fields (What's the difference between "store" and "doc_values" in field properties?). That should be okay.
But does it have any implications on heap utilization compared to doc_values? I don't search, sort, aggregate on binary field.

Thanks
Vinayak

dadoonet · April 17, 2018, 4:18pm

If you just store, I think that store is better than doc_values.
doc_values is meant for sort and aggs.

But again, I'd store the binary content on a FS, distributed FS like HDFS, CouchDB, or whatever... And just put the link to the datastore in elasticsearch.

You'll pay a price in IO at some point when merging of segments will happen.

My 2 cents

jpountz · April 17, 2018, 4:45pm

There is a bug here. I opened https://github.com/elastic/elasticsearch/issues/29565.

I agree that store would be a better fit than doc values for your use-case. Doc values might look faster but they might not scale as well when you have lots of data and want to request multiple fields per document. This is due to the fact that stored fields have all values for a given document in a single place while doc values have conceptually one different file per field.

jpountz · April 19, 2018, 9:38am

A bugfix has been pushed and will be available in 6.3: https://github.com/elastic/elasticsearch/pull/29567.

Vinayak_Sapre · April 22, 2018, 12:54pm

David and Adrien,

I tested with a larger dataset. Using stored value instead of docvalue solved the problem.

Thanks for the solution.

Vinayak

system · May 20, 2018, 12:54pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Storing binary files in Elastic Elasticsearch	8	20511	July 5, 2017
Problem with searching binary data Elasticsearch	8	2583	July 5, 2017
Indexing Binary vs text Elasticsearch	1	402	July 6, 2017
Storing binary data in ES Elasticsearch	1	1168	July 10, 2019
Using ES as a distributed datastore to only store binary data (mainly JPG, PNG, SVG), basically replacing our use of GlusterFs Elasticsearch	13	6179	July 6, 2017

Issue storing binary fields in 6.2.2

Related topics