Slow simple aggregation related to a not analyzed string

gleroy · November 15, 2016, 1:00pm

Hello,

I noticed that some queries/aggregations were very slow related to their size, on elasticsearch 2.4.

curl -XPOST 'localhost:9201/logstash_netflow_client_v12-2016.11.11/_search?pretty' -d'
{
  "query": {
    "filtered": {
      "query": {
        "query_string": {
                  "query": "*",
          "analyze_wildcard": true
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "query": {
                "query_string": {
                  "query": "login:somelogin",
                  "analyze_wildcard": true
                }
              }
            },
            {
              "range": {
                "Timestamp": {
                  "gte": 1478873520648,
                  "lte": 1478877120648,
                  "format": "epoch_millis"
                }
              }
            }
          ],
          "must_not": []
        }
      }
    }
  }
}'

This request takes ~30 ms and returns 86 hits.

When I add this aggregation :

  "size": 0,
  "aggs": {
    "2": {
      "aggs": {
        "1": {
          "sum": {
            "field": "bits"
          }
        }},
          "terms": {
            "field": "talking",
            "size": 10,
            "order": {
              "1": "desc"
            }
          }
    }
  }

I notice that the aggregation takes me ~10s. Here, "talking" is a not analyzed string and bits a numeric value. When I change this aggregation for an aggregation on a numeric field (ex: packets)
instead of talking, it takes ~60ms (which is expected).

When I change it for another string field (login), it only takes a few ms.

The format of login is content@content, whereas the format of talking is IP:PORT<->IP2:PORT2. Before, it was "" instead of "<->" but I suspected this character could be the cause of the length.

So, does anyone have an idea of the problem here, or how can I troubleshoot it ?

Thank you,
Regards,
Grégoire Leroy

gleroy · November 17, 2016, 1:28pm

Additionnally, I confirm that I use doc values, so I really do not understand why I have issue on this particular field, especially when I have so few hits.

Did I miss something ?

Thank you,
Regards,
Grégoire

gleroy · November 24, 2016, 10:17am

Hello,

Is there any additionnal information I can give about this issue to make it more thorough ? I really do not understand why these small requests on not analyzed strings, with doc_values, are taking so long.

Regards,
Grégoire Leroy

Mark_Harwood · November 24, 2016, 10:23am

I imagine talking has many unique values that need to be evaluated - pretty memory intensive.
Can you use the cardinality agg just to let us know how many unique values you have here?
Also the same for IP:PORT on its own if you have that as a separate field.

Mark_Harwood · November 24, 2016, 10:31am

Sorry. Just re-read the question and I see the problem agg example was missing the query from the first example. Ignore my last comment. You''re saying this is part of the request and it only should process 86 hits?

The problem is likely the fixed cost of loading "global ordinals" - a more compact representation of high cardinality values. Rather than creating interim buckets keyed with the string IP1-?>IP2 we use ordinals which are the term number in the sorted array of all unique terms. It saves space when considering large numbers of buckets. If you only have 86 hits then we may as well use the actual string values for labelling each bucket rather than messing around with ordinals. This behaviour is controlled using the execution_hint [1] which I suggest you change to map

[1] https://www.elastic.co/guide/en/elasticsearch/reference/5.0/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-execution-hint

gleroy · November 24, 2016, 5:20pm

Hello,

Indeed, map seems to be the right execution_hint. Talking is a field with huge cardinality (dozens of millions), but most requests are only on a small sample (dozens of thousands documents, compared to a few billions of documents) with a cardinality of a few thousands.

Is there a way to set the execution_hint in kibana or should I use a proxy which intercepts the requests and add the parameter on the fly ?

Thank you very much,
Regards,
Grégoire

Mark_Harwood · November 24, 2016, 5:23pm

Normally the "advanced" section of a Kibana config tab let's you add arbitrary JSON which is merged with the terms agg that Kibana formulates.

system · December 22, 2016, 5:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.