My application stores long text (>32K) values in binary field. The code works well in 5.2.1. However when I upgraded to 6.2.2 I am seeing strange behavior. Blob of one document gets attached to other documents.
Even though I have used small text in the example actual length is > 32K. Keyword field with docvalue true throws exception DocValuesField is too large, must be <= 32766.
Also, I don't search on this field. It's used only for display.
Thanks David.
That's a workaround I thought too. It will work for me in one index which is small but not for other which is significantly large (in TBs).
Large index data never gets updated. Both indices have lot of other fields that are indexed. I set _source off in mappings and only rely on doc values. Even with best compression, we get good amount of saving by turning off "_source" in the mappings. I kept _source in this example for two reasons
demonstrate the issue that document reaches Elasticseach correctly but something fails during indexing or while reading it back.
Verify issue easily. It's not easy to spot difference in base64 encoded values.
Since my current approach works well in 5.2.1, I want make sure I have not missed something in mappings or in query specific to 6.2.2. Secondly using binary field with DocValues approach is not an anti-pattern or not being considered for deprecation.
Also, if it turns out to be 6.2.2 defect it would be nice to know if it can affect anything else.
TBH I'm not sure it's a very good thing to store binaries in elasticsearch.
It's not binary like a jpg. Raw value is actually string longer than 32K. We perform text analysis on this value. Since it's longer than 32K and I need to retrive complete string back, I wasn't able to store it as keyword.
I use two fields one of type text (used for searching) and one binary (for retrieving).
Coming back to your solution, store:true worked with a small test. I also tried storing text string larger than 32K. It worked too. I feel store:true is better option compared to enabling _source. I will try with the real data.
If you just store, I think that store is better than doc_values. doc_values is meant for sort and aggs.
But again, I'd store the binary content on a FS, distributed FS like HDFS, CouchDB, or whatever... And just put the link to the datastore in elasticsearch.
You'll pay a price in IO at some point when merging of segments will happen.
I agree that store would be a better fit than doc values for your use-case. Doc values might look faster but they might not scale as well when you have lots of data and want to request multiple fields per document. This is due to the fact that stored fields have all values for a given document in a single place while doc values have conceptually one different file per field.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.