Slow simple aggregation related to a not analyzed string


(Grégoire Leroy) #1

Hello,

I noticed that some queries/aggregations were very slow related to their size, on elasticsearch 2.4.

curl -XPOST 'localhost:9201/logstash_netflow_client_v12-2016.11.11/_search?pretty' -d'
{
  "query": {
    "filtered": {
      "query": {
        "query_string": {
                  "query": "*",
          "analyze_wildcard": true
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "query": {
                "query_string": {
                  "query": "login:somelogin",
                  "analyze_wildcard": true
                }
              }
            },
            {
              "range": {
                "Timestamp": {
                  "gte": 1478873520648,
                  "lte": 1478877120648,
                  "format": "epoch_millis"
                }
              }
            }
          ],
          "must_not": []
        }
      }
    }
  }
}'

This request takes ~30 ms and returns 86 hits.

When I add this aggregation :

  "size": 0,
  "aggs": {
    "2": {
      "aggs": {
        "1": {
          "sum": {
            "field": "bits"
          }
        }},
          "terms": {
            "field": "talking",
            "size": 10,
            "order": {
              "1": "desc"
            }
          }
    }
  }

I notice that the aggregation takes me ~10s. Here, "talking" is a not analyzed string and bits a numeric value. When I change this aggregation for an aggregation on a numeric field (ex: packets)
instead of talking, it takes ~60ms (which is expected).

When I change it for another string field (login), it only takes a few ms.

The format of login is content@content, whereas the format of talking is IP:PORT<->IP2:PORT2. Before, it was ":left_right_arrow:" instead of "<->" but I suspected this character could be the cause of the length.

So, does anyone have an idea of the problem here, or how can I troubleshoot it ?

Thank you,
Regards,
Grégoire Leroy


(Grégoire Leroy) #2

Additionnally, I confirm that I use doc values, so I really do not understand why I have issue on this particular field, especially when I have so few hits.

Did I miss something ?

Thank you,
Regards,
Grégoire


(Grégoire Leroy) #3

Hello,

Is there any additionnal information I can give about this issue to make it more thorough ? I really do not understand why these small requests on not analyzed strings, with doc_values, are taking so long.

Regards,
Grégoire Leroy


(Mark Harwood) #4

I imagine talking has many unique values that need to be evaluated - pretty memory intensive.
Can you use the cardinality agg just to let us know how many unique values you have here?
Also the same for IP:PORT on its own if you have that as a separate field.


(Mark Harwood) #5

Sorry. Just re-read the question and I see the problem agg example was missing the query from the first example. Ignore my last comment. You''re saying this is part of the request and it only should process 86 hits?

The problem is likely the fixed cost of loading "global ordinals" - a more compact representation of high cardinality values. Rather than creating interim buckets keyed with the string IP1-?>IP2 we use ordinals which are the term number in the sorted array of all unique terms. It saves space when considering large numbers of buckets. If you only have 86 hits then we may as well use the actual string values for labelling each bucket rather than messing around with ordinals. This behaviour is controlled using the execution_hint [1] which I suggest you change to map

[1] https://www.elastic.co/guide/en/elasticsearch/reference/5.0/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-execution-hint


(Grégoire Leroy) #6

Hello,

Indeed, map seems to be the right execution_hint. Talking is a field with huge cardinality (dozens of millions), but most requests are only on a small sample (dozens of thousands documents, compared to a few billions of documents) with a cardinality of a few thousands.

Is there a way to set the execution_hint in kibana or should I use a proxy which intercepts the requests and add the parameter on the fly ?

Thank you very much,
Regards,
Grégoire


(Mark Harwood) #7

Normally the "advanced" section of a Kibana config tab let's you add arbitrary JSON which is merged with the terms agg that Kibana formulates.


(system) #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.