hello,
I wrote a datasketch aggregation plugin for elasticsearch, I have tested it in some situations, and its performance can meet my needs. I really want to share it and contribute it to the community to make elasticsearch support datasketches.
Please let me briefly introduce my plugin:
es-datasketches aggregation plugin
This plugin provides a new way of aggregation, I call it "hll_sketch". As the name suggests, it reads binary doc value, deserialize them as HllSketch objects, combine them, finally, calculate their estimate value.
What functions this plugin provides
-
aggregate serialized hll sketch binary data and give the final estimate of them
-
No configuration required, can be used just after installation
How does my plugin work
The workflow of my plugin is standard:
-
My plugin registers an
AggregationSpecwith es at startup -
When the user uses hll_sketch type of aggregation, my
HllSketchAggregationBuilderis created -
Through
HllSketchAggregatorFactory,HllSketchAggregatoris constructed and usesLeafBucketCollectortocollcetdoc values -
Use
InternalHllSketchAggregationfor reduce phase -
ResultReadergets the final result
how i tested this plugin
-
I used a private data set, which contains about 75,000,000 serialized hll sketches, distributed in 31-day indices.
-
I tested 40k random query for a cluster contain 10 nodes, each node is installed on a different gcp container.
-
each container has 32 cpu cores, 128 GB memory
-
My qps testing framework is Locust, I keep increasing concurrency until pct95 reaches 1.5s
-
Compared with a druid cluster with the same hardware configuration, elasticsearch has a performance improvement of about 150% when the response time of pct95 is similar.(qps of druid:500, qps of es with plugin: 1350)
Whats new features that will be added in the future
-
Theta sketches support
-
Rounding type
-
New data type called "hllsketch" and "thetasketch" which supports sketch update
What do I need to provide to contribute to the community
-
Although this plugin is very simple to implement, I hope that everyone can use this function without having to reinvent the wheel
-
Do you have any other suggestions and needs
-
Where should the code be placed? Maybe I should build a code repository on github by myself and give it to you?
-
In addition to the code, I should also provide some benchmarks, what indicators are we more concerned about
Please discuss the above question with me, thank you

