(My elasticsearch version is 6.8.0)
I have a lots of DataSketches binary data stored in an index as binary field. I need to deserialize them and combine them together to get the total estimate of them. So I try to wrote a DataSketches aggregation plugin for elasticsearch to combine and calculate them, the result is good, but i found a strange thing about binary data field.
I have to open the doc_value for my binary field, because i need to aggregate those field. But it seems cost too much disk space, for example, 10GB without [
_source]. I put the same data into druid, an open-source OLAP engine which can also do this job, and discovered those data just cost it only 50MB to store without source data. And, If i close the doc value of this field, it cost 200MB in elasticsearch, and that is acceptable.
I don't understand why doc value need so many space, and I don't want to give up(use elasticsearch to process DataSketches objects), cause i find out the elasticsearch with my plugin has a better performance than druid.
I realized, the doc value of es binary data type is stored in lucene as sorted-binary doc value(see: https://lucene.apache.org/core/7_7_0/core/org/apache/lucene/codecs/lucene70/Lucene70DocValuesFormat.html), for sorting, aggregating and scipting.
And lucene's sorted doc value is"a mapping of ordinals to deduplicated terms is written as Prefix-compressed Binary, along with the per-document ordinals written using one of the numeric strategies above".
the ordinals of My datasketch objects would be very large, cause they can't duplicate, they are really difference.
What could I do to reduce the disk space cost of binary data type? Or could I use other data type or way to make the things work without cost so many disk space in elasticsearch? Please help me and give me some advice, thank you very much. If y