Terms aggregation query performance slow

caddala · July 14, 2016, 4:25pm

I am indexing metric names in Elasticsearch. Metric names are of the form foo.bar.baz.aux. Here is the index I use.

> {
>     "index": {
>         "analysis": {
>             "analyzer": {
>                 "prefix-test-analyzer": {
>                     "filter": "dotted",
>                     "tokenizer": "prefix-test-tokenizer",
>                     "type": "custom"
>                 }
>             },
>             "filter": {
>                 "dotted": {
>                     "patterns": [
>                         "([^.]+)"
>                     ],
>                     "type": "pattern_capture"
>                 }
>             },
>             "tokenizer": {
>                 "prefix-test-tokenizer": {
>                     "delimiter": ".",
>                     "type": "path_hierarchy"
>                 }
>             }
>         }
>     }
> }

> {
>     "metrics": {
>         "_routing": {
>             "required": true
>         },
>         "properties": {
>             "tenantId": {
>                 "type": "string",
>                 "index": "not_analyzed"
>             },
>             "unit": {
>                 "type": "string",
>                 "index": "not_analyzed"
>             },
>             "metric_name": {
>                 "index_analyzer": "prefix-test-analyzer",
>                 "search_analyzer": "keyword",
>                 "type": "string"
>             }
>         }
>     }
> }

The above index creates the following terms for a metric name foo.bar.baz

foo
bar
baz
foo.bar
foo.bar.baz

If I have bunch of metrics, like below

a.b.c.d.e
a.b.c.d
a.b.m.n
x.y.z

I have to write a query to grab the nth level of tokens. In the example above

> for level = 0, I should get [a, x] 
> for level = 1, with 'a' as first token I should get [b]
>                with 'x' as first token I should get [y]  
> for level = 2, with 'a.b' as first token I should get [c, m]

I couldnt think of any other way, other than to write terms aggregation. To figure out level 2 tokens of a.b, here is the query I came up with.

> time curl -XGET http://localhost:9200/metrics_alias/metrics/_search\?pretty\&routing\=12345 -d '{
>       "size": 0,
>       "query": {
>         "term": {
>             "tenantId": "12345"
>         }
>       },
>       "aggs": {
>           "metric_name_tokens": {
>               "terms": {
>                   "field" : "metric_name",
>                   "include": "a[.]b[.][^.]*",
>                   "execution_hint": "map",
>                   "size": 0
>               }
>           }
>       }
>   }'

This would result in the following buckets. I parse the output and grab [c, m] from there.

>     "buckets" : [ {
>          "key" : "a.b.c",
>          "doc_count" : 2
>        }, {
>          "key" : "a.b.m",
>          "doc_count" : 1
>      } ]

So far so good. The query works great for most of the tenants(notice tenantId term query above). For certain tenants which has large amounts of data (arouund 1 Mil), the performance is really slow. I am guessing all the terms aggregation takes time.

I am wondering if terms aggregation is the right choice for this kinda data and also looking for other possible kinds of queries.

Topic		Replies	Views
Terms aggregation is breaking field into tokens Elasticsearch	2	709	July 5, 2017
Terms Aggregation buckets returns only single words and not phrases. Truncates the text after space Elasticsearch	3	1256	July 6, 2017
"top_hits" performance inside 2 levels of "terms" aggregations Elasticsearch	6	2462	August 4, 2017
Aggregation on a materialized path Elasticsearch	3	3590	July 5, 2017
Term Aggregation over Nested Documents - SOLVED, user error :-) Elasticsearch	1	1021	July 5, 2017

Terms aggregation query performance slow

Related topics