Currently I can find unique users with in data range, but how do I compare it with my whole corpus to find how many of them are new and how many of them repetitive. Current query to find unique user with in time range
I think it is hard to calculate in Elasticsearch only.
You get terms aggs twice, one is total one is a day.
Then you compare these data on your familiar programing language.
It is easy way to do this.
One way to perform this type of user centric analysis is to create a separate entity-centric index. This allows you to spread out the computation and prepare the data over time rather than do it all at query time, which can be expensive and complicated. If designed correctly it should also be possible to use this entity-centric index directly in Kibana, and as it will contained summarised and aggregated information it will generally perform and scale quite well.
Depending on the scale of your problem. Lets say you're talking RTB scale, then entity centric indexes and some batch processing are your main options.
For processing billions of signals we have the following rough breakdown:
Trail Collection (Audit log) with raw data and minimal indexed fields
CurrentProfile - a sliding time window index with verbose aggregation of pretty much everything we may care about
DeviceIdMapping - ID to ID mapping. Gives you cheap existence check among other things.
Profile - That is your longer lived
If you have your profile object index then you can query & aggregate by creation timestamp, last action timestamp, whatever makes sense to the app
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.