Compare aggregation result in Elastic Search to find repetitive users


Basically I want to do something like this

Currently I can find unique users with in data range, but how do I compare it with my whole corpus to find how many of them are new and how many of them repetitive. Current query to find unique user with in time range

{ "from": 0, "size": 0, "query": { "filtered": { "query": { "query_string": { "analyze_wildcard": true, "query": "*" } }, "filter": { "bool": { "must": [ { "range": { "date_time": { "lte": 1468348199000, "format": "epoch_millis", "gte": 1468261800000 } } } ], "must_not": [] } } } }, "aggs": { "cardinality_device_id": { "terms": { "field": "device_id" } } }, "fields": [ "*", "_source" ] }

Any help will be appreciated. Thanks


I think it is hard to calculate in Elasticsearch only.
You get terms aggs twice, one is total one is a day.
Then you compare these data on your familiar programing language.
It is easy way to do this.

What if data set is too large ?

One way to perform this type of user centric analysis is to create a separate entity-centric index. This allows you to spread out the computation and prepare the data over time rather than do it all at query time, which can be expensive and complicated. If designed correctly it should also be possible to use this entity-centric index directly in Kibana, and as it will contained summarised and aggregated information it will generally perform and scale quite well.

Depending on the scale of your problem. Lets say you're talking RTB scale, then entity centric indexes and some batch processing are your main options.

For processing billions of signals we have the following rough breakdown:

  • Trail Collection (Audit log) with raw data and minimal indexed fields
  • CurrentProfile - a sliding time window index with verbose aggregation of pretty much everything we may care about
  • DeviceIdMapping - ID to ID mapping. Gives you cheap existence check among other things.
  • Profile - That is your longer lived

If you have your profile object index then you can query & aggregate by creation timestamp, last action timestamp, whatever makes sense to the app