Cardinality Aggregation Hashes Only

msukmanowsky · June 7, 2014, 9:54pm

Hi there,

We're using ES for web analytics purposes and so far, have loved the
experience. We create hourly indexes that contain only one type of "url"
document which has multiple metrics fields like "page_views". We've
recently begun looking into how to store more complex metrics that require
set arithmetic such as "unique views" or "unique visitors".

While the cardinality aggregation
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html is
awesome, it seems like it'd be crazy for us to store all the user IDs that
we saw even for an hour on certain URLs as the number could grow to be very
large, very quickly. Just to clarify, this is the document schema I'm
saying would probably be silly:

{
"url": "http://example.com/",
"hour": "2014-05-31T03:00:00"
"user_ids": [
"e4c88ac4-ccc7-49e0-9a2e-34ab24420d2b",
"252d0f6e-2e9d-487d-95f4-ac3d53cce977",
"90b5d83b-44d6-4462-9f4b-3ab41e75143e",
"b6c9d0f8-5e4f-4308-92eb-be68d7b06d78",
"7a097ac1-7410-4918-a780-0020197d0b14"
],
"metrics": {
"page_views": 100
}
}

Being fairly new to Lucene and ES, I don't really know what a massive (>
100K) user_ids array per document would do to ES/Lucene at indexing or
query time. In addition, although that structure would allow us to query
for hourly URLs that contained a certain user_id, it's probably beyond our
current scope. Precomputing the unique number per hour doesn't help us
when we want to perform aggregations at query time and know unique users
across a series of hours.

Toying around with two approaches in my head, and I wanted to get some
feedback:

Find a way to store only the HLL object in ES but without the actual
array of distinct values. This way, we have the benefit of the cardinality
aggregations, but without storing the full set of user_ids. Is there a way
to do this?
Store a binary blob which represents a custom HLL that we'll create
and index. Create a new aggregation for a bitwise OR operation on that
binary object which would allow us to union the HLLs in the aggregation and
return that result

I lean a little bit more to solution #2 only because we'd prefer to have
the HLL's accuracy tuneable instead of rely in ES defaults.

Would love to hear some thoughts on how to solve this kind of issue.

Mike

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1af2370f-c402-44ac-b05d-fe0b1bee00a8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

photonic_world_2 · July 16, 2016, 11:14pm

Hello,

I have a similar problem, would love to hear how you got about solving this.

Thanks!

msukmanowsky · July 18, 2016, 2:55am

Hi @photonic_world_2 . We ended up storing user IDs as an array but with a few important caveats in the mapping (note, we're still on Elasticsearch 1.7.2):

{
   "my_index": {
      "mappings": {
         "my_doc_type": {
            "properties": {
               "visitors": {
                  "type": "murmur3",
                  "index": "no",
                  "doc_values": true,
                  "fielddata": {
                     "format": "doc_values"
                  }
               }
            }
         }
      }
   }

To walk you through these:

"type": "murmur3" ensures that values in this field pre-hashed using murmur3 and the result of that hash is stored in the visitors.hash field. This saves us from having to perform hashing at query time which significantly slows down cardinality aggregations.
"index": "no" specifies that we don't need this field searchable, so don't add it to the inverted index. If you need the ability to search for specific visitors in docs, you'll have to set this to not_analyzed, but be prepared to pay an indexing and disk penalty.
doc_values is critical for cardinality aggregates and is now the default for all properties in ES > 2.0 (see https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html for more info)

One last thing, if you end up storing 1,000s of user IDs per doc, you'll also likely want to disable _source for this field as storing the JSON blob of a huge set of user IDs takes up a ton of disk. I think we were able to cut storage requirements in half or more by disabling source and the inverted index for this field.

Hope that helps.

photonic_world_2 · July 18, 2016, 3:41pm

Thanks @msukmanowsky for the details. We have a similar mapping setup, while our requirement needs 2 cardinality aggregations on the same field. Even though these aggregations are on the same level (i.e not sub aggregations, see vistors under category_agg and visitors). I see the effects of combinatorial explosion . Trying to understand how and why

"aggregations": {
"category_agg": {
"terms": {
"field": "category"
},
"aggregations": {
"visitors": {
"cardinality": {
"field": "visitors.hash",
"precision_threshold": 10000
}
},
"total_recipients": {
"value_count": {
"field": "visitors.hash"
}
}
}
},
"visitors": {
"cardinality": {
"field": "visitors.hash",
"precision_threshold": 10000
}
}
}

Topic		Replies	Views
Cardinality computation Elasticsearch	13	2004	July 5, 2017
Cardinality Aggregation gives wrong number? Elasticsearch	33	7751	March 7, 2019
Why aren't HyperLogLog++ data structures stored in docs? Elasticsearch	4	1746	July 5, 2017
Cardinality Aggregation - Different Unique Counts! Elasticsearch	18	4758	July 6, 2017
Can I calculate the cardinality of the _id field? Elasticsearch	3	1103	July 6, 2017

Cardinality Aggregation Hashes Only

Related topics