Frequency capped sum


(Jean Logeart) #1

I have indexed documents of the form:

{
  "device_id": "abc",
  "views": 123,
  + other criteria
}

I can compute:

How can I compute the sum of the views such that a given device can only account for a maximum of n views?

For example, if my docs are:

{"device_id": "a", "views": 3, ...}
{"device_id": "a", "views": 4, ...}
{"device_id": "a", "views": 1, ...}
{"device_id": "b", "views": 2, ...}
{"device_id": "c", "views": 6, ...}

And my n is 5, then the result should be 12 = (5 for a even though its total is 8 + 2 for b + 5 for c)

My indices contain ~500,000 distinct devices.

The result does not need to be exact and can be approximate within reasonable bounds.

I do not mind using my own script using a combinations of techniques (HLL, Count-Min Sketch, Bloomfilters, Min Hash, ...)


(system) #2