BucketCollector takes long time occasionally

Hi.

I execute a aggregation query which matches only less than 100 documents over the index which has 200m documents with 10 shards (1 replica)

Most of the time, the query is executed very quickly (less than 20ms) but sometimes it takes 300ms ~ a couple of seconds. I cannot figure out which one is a problem.

Before profiling, I thought Disk I/O would be the problem. But It turned out that BucketCollector took long time at that time.

As you can see below, all queries are ended within 10ms, but BucketCollector was took 286ms (As I said, Collector would collect only less than 100 documents)

My Questions

  1. The result of profiling is trustworthy? (sometimes time is greater than took)
  2. If BucketCollect is the root cause of lag, what can I do for improving performance? I hope you can give me recommendations.

Thanks in advanced.

   "profile": {
      "shards": [
         {
            "id": "[...][index-name][4]",
            "searches": [
               {
                  "query": [
                     {
                        "query_type": "ConstantScoreQuery",
                        "lucene": "ConstantScore(+field1:keyword1 +field2:keyword2 +(+*:* -field3:) +ConstantScore(_field_names:field3) +(+*:* -field4) +ConstantScore(_field_names:field4))",
                        "time": "8.855492000ms",
                        ...
                        },
                        "children": [
                           {
                              "query_type": "BooleanQuery",
                              "lucene": "+field1:keyword1 +field2:keyword2 +(+*:* -field3:) +ConstantScore(_field_names:field3) +(+*:* -field4) +ConstantScore(_field_names:field4)",
                              "time": "6.159800000ms",
                              ...
                              "children": [
                                 {
                                    "query_type": "TermQuery",
                                    "lucene": "field1:keyword1",
                                    "time": "0.3522230000ms",
                                    ...
                                 },
                                 {
                                    "query_type": "TermQuery",
                                    "lucene": "field2:keyword2",
                                    "time": "0.8661380000ms",
                                    ...
                                 },
                                 {
                                    "query_type": "BooleanQuery",
                                    "lucene": "+*:* -field3:",
                                    "time": "0.5188010000ms",
                                    ...
                                    "children": [
                                       {
                                          "query_type": "MatchAllDocsQuery",
                                          "lucene": "*:*",
                                          "time": "0.0008740000000ms",
                                          ...
                                       },
                                       {
                                          "query_type": "TermQuery",
                                          "lucene": "field3:",
                                          "time": "0.1813230000ms",
                                          ...
                                       }
                                    ]
                                 },
                                 ...
                                 ...
                              ]
                           }
                        ]
                     }
                  ],
                  "rewrite_time": 22178,
                  "collector": [
                     {
                        "name": "MultiCollector",
                        "reason": "search_multi",
                        "time": "286.5113430ms",
                        "children": [
                           {
                              "name": "TotalHitCountCollector",
                              "reason": "search_count",
                              "time": "0.01601100000ms"
                           },
                           {
                              "name": "BucketCollector: [[agg1, agg1_cnt]]",
                              "reason": "aggregation",
                              "time": "286.4594230ms"
                           }
                        ]
                     }
                  ]
               }
            ]
         }

Maybe provide a bit more information on the versions that you use, the amount of nodes and available memory. Could it be you run into memory issues? DO you allocate all memory at once? Did you check for Garbage Collect runs? Maybe system itself is out of memory?

Sorry for insufficient information.

  • ES 2.3.2
  • more than 10 nodes
  • no gc occurred at that time (no young and old gc, I'm using CMS gc)
  • data size is greater than memory
  • DO you allocate all memory at once? => No, I can't
  • executionHint: map
  • collect_mode: breadth_first

As you expect, Disk I/O is suspicious, if then I think searches should be slow. I'm wondering why BucketCollector is slow. I think BucketCollector is a CPU bound job because it aggregates filtered documents from disk.

Can BucketCollector generate Disk I/O?

Thanks.

Aggregations use doc_values, in the end these are files on disk put into the OS memory if I am correct. So it could be Disk I/O

Aggregations use doc_values, in the end these are files on disk put into the OS memory if I am correct. So it could be Disk I/O

That makes sense.

Thanks.