Cardinality computation

lokeshhctm · September 1, 2015, 5:44pm

We have avery large data set and hence we are storing data in elastic search with aggregated values with group by of a few columns. The problem now we are in, where we also need to do the cardinality.

Since we can not store each document separately and we store document with pre aggregation. How can we store some value along with aggression to figure out the cardinality later with this aggregated data.

Can we pre compute some value which can be stored with one data row and later that same column computation values returns cardinality based on filtered query but on aggregated data.

Igor_Motov · September 1, 2015, 6:26pm

Could you provide some sample data and desired output? It's not very clear to me how your data is structured and what you are trying to achieve there.

lokeshhctm · September 1, 2015, 7:56pm

We have several requests coming from different users and each user has different user ID.
we also return response for each incoming request. Since we have incoming more than 1 billion requests, we can not store each record in elastic search we actually do some aggregation of requests based on a few column like:
The request from FireFox browser like "50", from Internet explorer "100". so in elastic search data store like:

Firefox: 50
Internet Explorer: 100

But in this case we loose the users because we don't know number of unique user in these 150 requests. we have almost 600million users in every 3B requests. with each aggregated row, can we store some pre calculated number which can used to calculate the unique users with any combination like:

Overall across all aggregated rows or filter by either explorer.
We have all of these aggregated data by hour which allow user to select time time range as well.

Centarion · September 1, 2015, 8:55pm

I believe this is possible in the same way that storing a count of requests by browser is. I am not certain how you are doing your pre-aggregation, but if you can generate a count of unique users in addition to the counts of requests by browser then you can have a document like:

Firefox: 50
Internet Explorer: 100
Unique Users: 125

If you want to get the count of unique users by browser you can run a filtered query by browser and then do a sum aggregation on Unique Users, or do a terms aggregation on browser and then a sum Unique Users sub aggregation. To get the count in each time slice, use a sum aggregation on Unique Users as a sub aggregation in your date histogram aggregation.

Let me know if that answers your question or if I misunderstood (or got something wrong).

lokeshhctm · September 1, 2015, 9:17pm

But some might be in both browsers so in that case sum including both is wrong

Centarion · September 1, 2015, 9:39pm

Oops I do now realize I was mistaken to suggest a terms aggregation on browser, since you have each browser set up as a separate field, and each document likely contains requests from each browser.

The way I would implement this, if starting from scratch would be to group requests by browser in pre-aggregation, and then have a document like:

{
browser: firefox,
requests: 50,
unique_users: 35
}

and then send several documents per aggregation period.

With your current format, in order to get unique users by browser you are going to have to separate the count of unique users by browser in pre-aggregation and then store them in a field that is unique to the browser, either as a subfield for the browser fields or as an additional field like firefox_unique_users.

Using your current format, it will be hard to actually use filters to do separate browsers, since every doc contains data from each. However, you could still do a sum aggregation across the firefox_unique_users field or the firefox.users subfield.

lokeshhctm · September 2, 2015, 3:50am

Can't we store intermediate state of hyperloglog algorithm in another column? i mean bitset of users in that aggregated row and later while aggregation, elasticsearch or custom script read this bitsets based on filters and merge these bitsets to further computation in hyperloglog to finally figure out the cardinality.

Only this way i guess we need not to store all unique records in ES.

Centarion · September 2, 2015, 2:26pm

So, just to make sure I understand correctly, instead of computing the number of unique users in each slice, you want to use hyperLogLog++ to produce the intermediate hash and store that?

As far as I know this is possible (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#_pre_computed_hashes). The only difference as far as I am aware is that you are using the hyperLogLog++ instead of a sum to compute your result.

I have no idea how the performance differs. I will say that I am currently using the cardinality aggregate to compute unique users, but in my data set every document is a single interaction. It works very well (I know it has constant time performance relative to number of documents).

I am still not certain how you plan to apply filters by browser without storing unique records for each browser/time-slice combo, or why you would want to, but I may just be missing the point. As I see it, if you want data specific to each browser, you are either going to have to use a field or sub-field for each browser to store that data or use a unique document per browser. If you do the first, you can easily get per-browser results by just aggregating counts on that unique field, if you do the second you can filter down to only matching documents and then compute. I may be missing something, but as far as I know filters only filter out documents, not parts of documents.

lokeshhctm · September 2, 2015, 2:34pm

You are correct, I am looking for something where i can store intermediate result of a aggregated bucket.
How can we compute that aggregated bucket value for distinct and store?

lokeshhctm · September 2, 2015, 2:40pm

The link you pasted doesn't describe the the single hash generation of multiple hashes of user ID and put that with each row. Can you please explain how can we do this?

Centarion · September 2, 2015, 2:50pm

I am not sure how this is going to help you accomplish your goal, though I do now realize the issue with using a sum on an already computed count. Yes, you will need to use a cardinality aggregate on the field in some form

This will not, as far as I understand, allow you to use filters to get results by browser without changing your data structure.

As far as how to pre-compute hashes of multiple userIds as an intermediate step to HyperLogLog++, I have no idea. I just pointed to that link to show that the cardinality agg can work on pre-indexed hashes. I don't think Elasticsearch supports this, you would have to dig in to the implementation of HyperLogLog++, or find a library in whatever language you are using (or maybe in a script on index time), to do that. On the other hand, I am not sure what the cardinality aggregation will do with such a hash of many values, it may or may not do what you want.

Honestly, I am not familiar with exactly how it works, especially not enough to make it do something like this. I wonder if the best solution isn't just to store the uniqueIds as an array field and ask for the cardinality of that, but I am not sure that works either.

lokeshhctm · September 2, 2015, 2:53pm

I will try to look in the algorithm to figure our how can be store something as intermediate result.

BTW thanks for the replies.

Centarion · September 2, 2015, 2:56pm

No problem. My one word of caution is that I would want to make sure that the cardinality aggregation will interpret your hash as an intermediate step composed of many unique values. From reading the thing I linked, it seems the reason to store hashes is to save computing them again at aggregation time.