Nested Aggregations are 5~10x times slower in ES 6.x than 5.6.x

I've prepare a test environment to try to find a way to fix this.

I have one machine with 16 cores and 64gb ram, with ES 5.6.8 and ES 6.2.4, each instances of ES have a XMX/XMS in 30 Gb

have only one index with 35.808.600 docs, 5 shards, codec: best_compression, _source: true, no stored_field's in both ES versions.

The Pri.Store.Size
in ES 5.6.8: 18,48 Gb
in ES 6.2.4: 12,20 Gb
why is this difference?

When perform the same aggregation in
ES 5.6.8 took: 358~530 ms
ES 6.2.4 took: 5200~12600 ms

why this happens?? what change in the mayor version than degrade the performance in this way??

i've compare all settings of ES cluster and index, and all are basically the same (using include_defaults=true to got defaults too)

the aggregation query is:

{
  "query": {
    "constant_score": {
      "filter": {
        "bool": {
          "must": [
            { "range": { "timestamp_utc": { "gt": "now-30d" } } },
            { "terms": { "element_id": [
                  "68C894", "BE6053",  ......... UNTIL TO 1000 ELEMENTS
                ] } } ] } } } },
  "size": 0,
  "aggs": { "by_element": { "terms": { "field": "element_id", "size": 999999 },
    "aggs": { "by_topic": { "terms": { "field": "topic", "size": 999999 },
      "aggs": { "by_group": { "terms": { "field": "group", "size": 999999 },
        "aggs": { "by_type": { "terms": { "field": "type", "size": 999999 },
          "aggs": { "by_sub_type": { "terms": { "field": "sub_type", "size": 999999, "missing": "N/A" },
            "aggs": { "by_position": { "terms": { "field": "position_name", "missing": "N/A", "size": 999999 },
              "aggs": { "by_position_id": { "terms": { "field": "position_id", "missing": "N/A", "size": 999999 },
                "aggs": { "sent_sub_type": { "sum": { "field": "event_score" } } }
              } }
            } }
          } }
        } }
      } }
    } }
  } }	
}

Thanks by advance

@thiago @Mark_Harwood @dadoonet @colings86 @mvg @jpountz guys some help here please.

Read this and specifically the "Also be patient" part.

It's fine to answer on your own thread after 2 or 3 days (not including weekends) if you don't have an answer.

Please don't ping directly people in your thread if they are not participated yet to the discussion

1 Like

Are you running both ES nodes with 30GB heap set on a single machine with 64GB ram?

sorry about that. but i've see oldest post related to aggregations without any response in the past. :anguished:

yes currently i'm testing in one physical instance with this specs, but i've have two cluster even, and happen the same than i've described before.

Ok, so regarding the disk space. It is expected that 6.x uses less storage since it ships with Lucene 7 that handles sparse indices much better. See https://www.elastic.co/blog/minimize-index-storage-size-elasticsearch-6-0

About the long time responses, your configuration will always provide very bad and unpredictable performance. Great part of Elasticsearch performance relies on OS-level filesystem cache. By running 2 JVM with 30GB on a system with 64GB RAM there won't be enough memory left for caching and Elasticsearch performance is unpredictable with such environment. See https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html#_give_less_than_half_your_memory_to_lucene

I've not run both elasticsearch versions at the same time.

in other way I have two cluster with 6 machines with NVMEs volumes, 64Gb of ram and 16 Cores, and happend the same.

i've tested from 6.0.0 to 6.2.4 going through all microversions to see if the performance drops occurs in a specific version and for my surprise in all 6.x version happens the same, this no occurs in any 5.x version.

this is related to a specific change between 5.x and 6.x and i don't know what? not a hardware or OS configuration.

Since you are using a fairly complex query there, it may be related to how many segments the index has. You could try running POST /<index_name>/_forcemerge?max_num_segments=1 and repeat the query to see if it's any better (depending on the index size the forcemerge operation may take a while).

If that stills does not cut it, then I suggest that you install x-pack and analyze the query performance using the Search Profiler in Kibana.

Thanks thiago, this helps a lot, the times was reduced from 5~12 sec to 1,5~3 sec, i have a doub, when i've reindex some indexes into one, the data is not merged by default?

how we can able in index time to remain the max_num_segments=1? its possible?

what other things i can do to reach the same response time in 6.x (like the 5.x)

thanks by advance

Elasticsearch will keep merging the index the background while there is data being indexed. To understand better what happens check the awesome Mike McCandless blog about it. The core issue here is not that it's not merging, but it seems that too many tiny segments are being created (apparently). Are you calling the refresh API externally/manually? Also, what's the refresh interval of the index?

That is not possible due to how merging happens. It can only reach a single segment by calling the API.

At this point the best is running the query against the Search Profiler to start investigating further for more potential bottlenecks.

I've found the possible cause of the problems related with this nested aggregations, specifically focused in the jump to the mayor version 6.x

from where i can download the ES 6.3.x from the url in the documentation is broken
https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.3.0.tar.gz

in that version the problem is fixed

2 Likes

this problem was fixed in this release? (6.3.0)

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.