I am indexing metric names in Elasticsearch. Metric names are of the form foo.bar.baz.aux. Here is the index I use.
> {
> "index": {
> "analysis": {
> "analyzer": {
> "prefix-test-analyzer": {
> "filter": "dotted",
> "tokenizer": "prefix-test-tokenizer",
> "type": "custom"
> }
> },
> "filter": {
> "dotted": {
> "patterns": [
> "([^.]+)"
> ],
> "type": "pattern_capture"
> }
> },
> "tokenizer": {
> "prefix-test-tokenizer": {
> "delimiter": ".",
> "type": "path_hierarchy"
> }
> }
> }
> }
> }
> {
> "metrics": {
> "_routing": {
> "required": true
> },
> "properties": {
> "tenantId": {
> "type": "string",
> "index": "not_analyzed"
> },
> "unit": {
> "type": "string",
> "index": "not_analyzed"
> },
> "metric_name": {
> "index_analyzer": "prefix-test-analyzer",
> "search_analyzer": "keyword",
> "type": "string"
> }
> }
> }
> }
The above index creates the following terms for a metric name foo.bar.baz
foo
bar
baz
foo.bar
foo.bar.baz
If I have bunch of metrics, like below
a.b.c.d.e
a.b.c.d
a.b.m.n
x.y.z
I have to write a query to grab the nth level of tokens. In the example above
> for level = 0, I should get [a, x]
> for level = 1, with 'a' as first token I should get [b]
> with 'x' as first token I should get [y]
> for level = 2, with 'a.b' as first token I should get [c, m]
I couldnt think of any other way, other than to write terms aggregation. To figure out level 2 tokens of a.b, here is the query I came up with.
> time curl -XGET http://localhost:9200/metrics_alias/metrics/_search\?pretty\&routing\=12345 -d '{
> "size": 0,
> "query": {
> "term": {
> "tenantId": "12345"
> }
> },
> "aggs": {
> "metric_name_tokens": {
> "terms": {
> "field" : "metric_name",
> "include": "a[.]b[.][^.]*",
> "execution_hint": "map",
> "size": 0
> }
> }
> }
> }'
This would result in the following buckets. I parse the output and grab [c, m] from there.
> "buckets" : [ {
> "key" : "a.b.c",
> "doc_count" : 2
> }, {
> "key" : "a.b.m",
> "doc_count" : 1
> } ]
So far so good. The query works great for most of the tenants(notice tenantId term query above). For certain tenants which has large amounts of data (arouund 1 Mil), the performance is really slow. I am guessing all the terms aggregation takes time.
I am wondering if terms aggregation is the right choice for this kinda data and also looking for other possible kinds of queries.