Can I contribute a datasketches aggregation plugin for elasticsearch?What should I provide?


I wrote a datasketch aggregation plugin for elasticsearch, I have tested it in some situations, and its performance can meet my needs. I really want to share it and contribute it to the community to make elasticsearch support datasketches.

Please let me briefly introduce my plugin:

es-datasketches aggregation plugin

This plugin provides a new way of aggregation, I call it "hll_sketch". As the name suggests, it reads binary doc value, deserialize them as HllSketch objects, combine them, finally, calculate their estimate value.

What functions this plugin provides

  • aggregate serialized hll sketch binary data and give the final estimate of them

  • No configuration required, can be used just after installation

How does my plugin work

The workflow of my plugin is standard:

  • My plugin registers an AggregationSpec with es at startup

  • When the user uses hll_sketch type of aggregation, my HllSketchAggregationBuilder is created

  • Through HllSketchAggregatorFactory, HllSketchAggregator is constructed and uses LeafBucketCollector to collcet doc values

  • Use InternalHllSketchAggregation for reduce phase

  • ResultReader gets the final result

how i tested this plugin

  • I used a private data set, which contains about 75,000,000 serialized hll sketches, distributed in 31-day indices.

  • I tested 40k random query for a cluster contain 10 nodes, each node is installed on a different gcp container.

  • each container has 32 cpu cores, 128 GB memory

  • My qps testing framework is Locust, I keep increasing concurrency until pct95 reaches 1.5s

  • Compared with a druid cluster with the same hardware configuration, elasticsearch has a performance improvement of about 150% when the response time of pct95 is similar.(qps of druid:500, qps of es with plugin: 1350)

Whats new features that will be added in the future

  • Theta sketches support

  • Rounding type

  • New data type called "hllsketch" and "thetasketch" which supports sketch update

What do I need to provide to contribute to the community

  • Although this plugin is very simple to implement, I hope that everyone can use this function without having to reinvent the wheel

  • Do you have any other suggestions and needs

  • Where should the code be placed? Maybe I should build a code repository on github by myself and give it to you?

  • In addition to the code, I should also provide some benchmarks, what indicators are we more concerned about

Please discuss the above question with me, thank you

Putting the code up somewhere is probably the first best step. GitHub or anything like that would be suitable.

Make sure you license it too, so people know if they can contribute, or what sort of usage limitations there are :slight_smile:

Okay, I will put the code and license on github first, then let us discuss what to do next :grinning:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.