Why Binary data with doc value cost so lots of disk space? Or could I use other type instead?

nooneuse · June 11, 2020, 4:03am

Hello,
(My elasticsearch version is 6.8.0)
I have a lots of DataSketches binary data stored in an index as binary field. I need to deserialize them and combine them together to get the total estimate of them. So I try to wrote a DataSketches aggregation plugin for elasticsearch to combine and calculate them, the result is good, but i found a strange thing about binary data field.
I have to open the doc_value for my binary field, because i need to aggregate those field. But it seems cost too much disk space, for example, 10GB without [_source]. I put the same data into druid, an open-source OLAP engine which can also do this job, and discovered those data just cost it only 50MB to store without source data. And, If i close the doc value of this field, it cost 200MB in elasticsearch, and that is acceptable.
I don't understand why doc value need so many space, and I don't want to give up(use elasticsearch to process DataSketches objects), cause i find out the elasticsearch with my plugin has a better performance than druid.
I realized, the doc value of es binary data type is stored in lucene as sorted-binary doc value(see: Lucene70DocValuesFormat (Lucene 7.7.0 API)), for sorting, aggregating and scipting.

And lucene's sorted doc value is"a mapping of ordinals to deduplicated terms is written as Prefix-compressed Binary, along with the per-document ordinals written using one of the numeric strategies above".

the ordinals of My datasketch objects would be very large, cause they can't duplicate, they are really difference.
What could I do to reduce the disk space cost of binary data type? Or could I use other data type or way to make the things work without cost so many disk space in elasticsearch? Please help me and give me some advice, thank you very much. If y

Mark_Harwood · June 11, 2020, 3:13pm

Compression was recently added to binary doc values in Lucene but is currently being revisited as not all use cases want to trade speed for disk space savings.

system · July 9, 2020, 3:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.