Cardinality: optimisation and memory footpring

Hello,

Basically, there are two points I would like to share with you: one on useless cardinality aggregation computation, and one on the way to measure the impact of the precision_threshold value.

1. Cardinality aggregation could be ignored to return 1

Currently (v6.2.1), when running a cardinality aggregation over bucket which impose a cardinality equal to 1 (because of a term query, or a parent aggregation), we can see that HyperLogLogPlusPlus object are instantiated.
I interpret this as the cardinality computation being performed, whereas optimisation could lead to ignore it and return a result of 1.

Example:

  • I have an index called book_index, with following data (only keywords):
    • {"title":"Fondation", "author":"Asimov"},
    • {"title":"Fondation et Empire", "author":"Asimov"},
    • {"title":"Seconde Fondation", "author":"Asimov"},
    • {"title":"1984", "author":"Orwell"},
    • {"title":"La ferme des animaux", "author":"Orwell"}.
  • When running following request, HyperLogLogPlusPlus object are instantiated (4 to be precise):
{
  "size": 0,
  "aggregations": {
    "agg_authors": {
      "terms": {
        "field": "author"
      },
      "aggregations": {
        "card_author": {
          "cardinality": {
            "field": "author"
          }
        }
      }
    }
  }
}
  • By construction, the result expected in the cardinality was 1, nonetheless, it was computed:
{
  "took" : 70,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "agg_authors" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Asimov",
          "doc_count" : 3,
          "card_author" : {
            "value" : 1
          }
        },
        {
          "key" : "Orwell",
          "doc_count" : 2,
          "card_author" : {
            "value" : 1
          }
        }
      ]
    }
  }
}
Curl commands
curl -XPUT '127.0.0.1:9200/book_index?pretty'         -d '{"settings":{"number_of_shards":1,"number_of_replicas":0},"mappings":{"_doc":{"properties":{"author":{"type":"keyword"},"title":{"type":"keyword"}}}}}'
curl -XPUT '127.0.0.1:9200/book_index/_doc/1?pretty'  -d '{"title":"Fondation", "author":"Asimov"}'
curl -XPUT '127.0.0.1:9200/book_index/_doc/2?pretty'  -d '{"title":"Fondation et Empire", "author":"Asimov"}'
curl -XPUT '127.0.0.1:9200/book_index/_doc/3?pretty'  -d '{"title":"Seconde Fondation", "author":"Asimov"}'
curl -XPUT '127.0.0.1:9200/book_index/_doc/4?pretty'  -d '{"title":"1984", "author":"Orwell"}'
curl -XPUT '127.0.0.1:9200/book_index/_doc/5?pretty'  -d '{"title":"La ferme des animaux", "author":"Orwell"}'
curl -XGET '127.0.0.1:9200/book_index/_search?pretty' -d '{"size":0,"aggregations":{"agg_authors":{"terms":{"field":"author"},"aggregations":{"card_author":{"cardinality":{"field":"author"}}}}}}'

In our case, people can filter result (which is a graph of average values) on several fields. When selecting 1 possible value, the cardinality (which is used to compute the average values) become useless.

Could we imagine optimisation on those use case (this would be so great :slight_smile: ) ? Or shall we define one request for each case ?

2. Impact of precision_threshold on the memory footprint

How can we measure the impact on the memory of the precision_threshold parameter ?
According to the specification:

  • fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.

For a precision threshold of c, the implementation that we are using requires about c * 8 bytes.

I monitored our elastic servers with JMC while running those cardinality aggregration with different configuration (default (3000), 100, 20, 5), but i could not see difference.

Any idea besides hprof ?

Any opinion on this kind of optimisation ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.