Can I contribute a datasketches aggregation plugin for elasticsearch?What should I provide?

nooneuse · June 17, 2020, 5:54am

hello,

I wrote a datasketch aggregation plugin for elasticsearch, I have tested it in some situations, and its performance can meet my needs. I really want to share it and contribute it to the community to make elasticsearch support datasketches.

Please let me briefly introduce my plugin：

es-datasketches aggregation plugin

This plugin provides a new way of aggregation, I call it "hll_sketch". As the name suggests, it reads binary doc value, deserialize them as HllSketch objects, combine them, finally, calculate their estimate value.

What functions this plugin provides

aggregate serialized hll sketch binary data and give the final estimate of them
No configuration required, can be used just after installation

How does my plugin work

The workflow of my plugin is standard:

My plugin registers an AggregationSpec with es at startup
When the user uses hll_sketch type of aggregation, my HllSketchAggregationBuilder is created
Through HllSketchAggregatorFactory, HllSketchAggregator is constructed and uses LeafBucketCollector to collcet doc values
Use InternalHllSketchAggregation for reduce phase
ResultReader gets the final result

how i tested this plugin

I used a private data set, which contains about 75,000,000 serialized hll sketches, distributed in 31-day indices.
I tested 40k random query for a cluster contain 10 nodes, each node is installed on a different gcp container.
each container has 32 cpu cores, 128 GB memory
My qps testing framework is Locust, I keep increasing concurrency until pct95 reaches 1.5s
Compared with a druid cluster with the same hardware configuration, elasticsearch has a performance improvement of about 150% when the response time of pct95 is similar.(qps of druid:500, qps of es with plugin: 1350)

Whats new features that will be added in the future

Theta sketches support
Rounding type
New data type called "hllsketch" and "thetasketch" which supports sketch update

What do I need to provide to contribute to the community

Although this plugin is very simple to implement, I hope that everyone can use this function without having to reinvent the wheel
Do you have any other suggestions and needs
Where should the code be placed？ Maybe I should build a code repository on github by myself and give it to you？
In addition to the code, I should also provide some benchmarks, what indicators are we more concerned about

Please discuss the above question with me, thank you

warkolm · June 17, 2020, 6:04am

Putting the code up somewhere is probably the first best step. GitHub or anything like that would be suitable.

Make sure you license it too, so people know if they can contribute, or what sort of usage limitations there are

nooneuse · June 17, 2020, 6:08am

Okay, I will put the code and license on github first, then let us discuss what to do next

system · July 15, 2020, 6:08am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to implement custom aggregations or algorithms Elasticsearch	3	599	July 6, 2017
Custom Aggregations Elasticsearch	16	4871	November 4, 2022
Querying based on a sketch in Elastic search Elasticsearch	3	527	July 5, 2017
Aggregation Query possible input ES plugin Logstash	4	5142	July 6, 2017
Input elasticseach - aggregate data Logstash	1	411	May 24, 2019