Hi there,
We're using ES for web analytics purposes and so far, have loved the
experience. We create hourly indexes that contain only one type of "url"
document which has multiple metrics fields like "page_views". We've
recently begun looking into how to store more complex metrics that require
set arithmetic such as "unique views" or "unique visitors".
While the cardinality aggregation
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html is
awesome, it seems like it'd be crazy for us to store all the user IDs that
we saw even for an hour on certain URLs as the number could grow to be very
large, very quickly. Just to clarify, this is the document schema I'm
saying would probably be silly:
{
"url": "http://example.com/",
"hour": "2014-05-31T03:00:00"
"user_ids": [
"e4c88ac4-ccc7-49e0-9a2e-34ab24420d2b",
"252d0f6e-2e9d-487d-95f4-ac3d53cce977",
"90b5d83b-44d6-4462-9f4b-3ab41e75143e",
"b6c9d0f8-5e4f-4308-92eb-be68d7b06d78",
"7a097ac1-7410-4918-a780-0020197d0b14"
],
"metrics": {
"page_views": 100
}
}
Being fairly new to Lucene and ES, I don't really know what a massive (>
100K) user_ids array per document would do to ES/Lucene at indexing or
query time. In addition, although that structure would allow us to query
for hourly URLs that contained a certain user_id, it's probably beyond our
current scope. Precomputing the unique number per hour doesn't help us
when we want to perform aggregations at query time and know unique users
across a series of hours.
Toying around with two approaches in my head, and I wanted to get some
feedback:
- Find a way to store only the HLL object in ES but without the actual
array of distinct values. This way, we have the benefit of the cardinality
aggregations, but without storing the full set of user_ids. Is there a way
to do this? - Store a binary blob which represents a custom HLL that we'll create
and index. Create a new aggregation for a bitwise OR operation on that
binary object which would allow us to union the HLLs in the aggregation and
return that result
I lean a little bit more to solution #2 only because we'd prefer to have
the HLL's accuracy tuneable instead of rely in ES defaults.
Would love to hear some thoughts on how to solve this kind of issue.
Mike
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1af2370f-c402-44ac-b05d-fe0b1bee00a8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.