Keyword term aggregations much slower after 5.6 -> 6.2 upgrade

Hello,

I recently upgraded my ES cluster from 5.6 to 6.2, along with Kibana, logstash, and filebeat upgrading from 5.6 to 6.2. I had to change my Elasticsearch mapping json file because 6.2 rejected things that 5.6 accepted. Now, data indexed after the upgrade is extremely slow to load into a Kibana table that uses term aggregatations for different keyword fields. If I go back to data that was indexed before the upgrade (still using all the upgraded software), it loads quickly like it used to.

In addition, if I apply the keyword aggregation column as a filter by clicking the + magnifying glass in Kibana, which creates queries of the following type:

{
  "query": {
    "match": {
      "my_text_field_1.keyword": {
        "query": "exact_match_for_the_keyword_text",
        "type": "phrase"
      }
    }
  }
}

ES now responds, "Cannot search on field [my_text_field_1.keyword] since it is not indexed.".

Again, if I go back to data indexed before the upgrade, this query works.

It seems like the culprit is the mapping change, given that viewing documents indexed before the upgrade works as expected, while stuff indexed with the new mapping works slowly and/or not at all. The aggregations still seem to work, as the table itself uses term aggregations across my_text_field_1, 2 and 3.

Below is the relevant section of the new mapping.

{
    "template": "bla-logger-*",
    "settings": {
        "index": {
          "refresh_interval": "10s",
          "number_of_replicas": 1,
          "number_of_shards": 6
        }
    },
    "mappings" : {
      "_default_" : {
       "_source" : { "enabled" : false }
      },
      "doc" : {
        "properties" : {
          "@timestamp" :                   { "type" : "date"},

          "my_text_field_1" :              {
                                                "type": "text",
                                                "fields" : {
                                                    "keyword" : { "type" : "keyword", "eager_global_ordinals": true}
                                                }
                                           },

          "my_text_field_2" :              {
                                                "type": "text",
                                                "fields" : {
                                                    "keyword" : { "type" : "keyword", "eager_global_ordinals": true}
                                                }
                                           },

          "my_text_field_3" :              {
                                                "type": "text",
                                                "fields" : {
                                                    "keyword" : { "type" : "keyword", "eager_global_ordinals": true}
                                                }
                                           },

Here is the relevant section of the old mapping, which I know has contradictory/outdated stuff in it, but worked better. Most of the changes made were because ES 6 was complaining about something wrong that ES 5 did not, so I had to figure out how to do it correctly.

"template": bla-logger-*",
"settings": {
    "index": {
      "refresh_interval": "10s",
      "number_of_replicas": 1,
      "number_of_shards": 6,
      "mapper.dynamic" : false
    }
},
"mappings" : {
  "_default_" : {
  "dynamic_templates": [
    {
      "strings": {
        "match": "*",
        "match_mapping_type": "string",
        "mapping":   { "type": "string",  "doc_values": true, "index": "not_analyzed" }
      }
    },
    {
      "floats": {
        "match": "*",
        "match_mapping_type": "floats",
        "mapping":   { "type": "float",  "doc_values": true, "index": "not_analyzed" }
      }
    }

   ],
   "_all" : { "enabled" : false },
   "_source" : { "enabled" : false }
  },
  "log" : {
    "properties" : {
      "@timestamp" :                   { "type" : "date", "doc_values": true },

  "my_text_field_1" :               {
                                        "type": "string",
                                        "fields" : {
                                            "text" :    { "type" : "text", "norms" : false },
                                            "keyword" : { "type" : "keyword",  "doc_values": true, "index": "not_analyzed", "eager_global_ordinals": true }
                                        }
                                   },

  "my_text_field_2" :                      {
                                        "type": "string",
                                        "fields" : {
                                            "text" :    { "type" : "text", "norms" : false },
                                            "keyword" : { "type" : "keyword",  "doc_values": true, "index": "not_analyzed", "eager_global_ordinals": true }
                                        }
                                   },

  "fmy_text_field_3" :                {
                                        "type": "string",
                                        "fields" : {
                                            "text" :    { "type" : "text", "norms" : false },
                                            "keyword" : { "type" : "keyword",  "doc_values": true, "index": "not_analyzed", "eager_global_ordinals": true}
                                        }
                                   },

The changes I made were as follows:

  1. Previously, I had mapper.dynamic set to false but I also specified some dynamic templates, which is nonsensical, and ES 6 told me so whereas ES 5 allowed it. I took out both the dynamic templates and the mapper.dynamic: false line, but ES still appears to be running with mapper.dynamic set to false.
  2. I took out all the string keywords and replaced them by text/keyword, at the behest of ES 6.
  3. I removed explicit settings of doc_values as I found that the default behavior in all cases was what I wanted.
  4. I removed explicit settings of the index keyword as I found that the default behavior in all cases was what I wanted (and ES 6 complained about my use of "not_analyzed".
  5. I removed the _all enabled: false line as the default behavior is now disabling it.
  6. Where I had previously used string mappings with fields that were text/keyword, I changed them to text mappings with a keyword field.

None of these changes look like obvious culprits to me... Can anyone see the issue here? Thanks for reading this far and let me know if any additional information would be helpful.

Hi,

Is there anything else I can add here to help troubleshoot this? Has anyone else seen the same issue?

Please don't post images of text as they are hardly readable and not searchable.

Instead paste the text and format it with </> icon. Check the preview window.

Could you update your question?

Hi,

I've updated the post as you said. I will also note that I found a work around at least for the keyword query not working. It is still much slower than it used to be to load a large (~3000 rows) table of keyword aggregations in Kibana.

As soon as I changed the field name from 'keyword' to 'raw' (I did this because it was the only visible difference I could find with the example in the docs), it worked as expected. This may be a bug with the new ES version wherein 'keyword' is meant to be a reserved field name, when it was not in 5.6.

To be clear, I changed, for example:

  "my_text_field_1" :               {
                                        "type": "text",
                                        "fields" : {
                                            "keyword" : { "type" : "keyword", "eager_global_ordinals": true}
                                        }
                                   },

to

  "my_text_field_1" :               {
                                        "type": "text",
                                        "fields" : {
                                            "raw" : { "type" : "keyword", "eager_global_ordinals": true}
                                        }
                                   },

Again, my keyword aggregation table (now aggregated on the .raw field instead of .keyword) is still quite slow compared to what it used to be. To load 15 minutes of a 3000 row table now takes about 100 seconds, 30 of which appears to be just to draw the table in the browser (the pink loading bar dissappears but the table has not yet appeared). This used to take <20 seconds to query the data and draw the table for the same time period and number of rows.

I don't think that changing the name of a field drastically changes the performance.

To load 15 minutes of a 3000 row table now takes about 100 seconds, 30 of which appears to be just to draw the table in the browser

Where are you doing that? Sounds like you are talking about Kibana.
The important thing to look at is what is the response time of elasticsearch.
Like run a query from Kibana console and look at the took value in the response.

It should not change a lot from using keyword or raw.

Thanks for your responses so far :).

Yes, changing the name of the keyword should not affect the performance and did not.

Changing the name of the keyword should not affect the query failures, but it did.

I thought these might be related issues when I was unable to resolve either but they may be unrelated. The query failure issue I was able to resolve, and the resolution makes it look like there is a bug:

Issue 1: Naming a keyword field name 'keyword' causes queries that should succeed to fail (SOLVED)

When the keyword field for my_text_field_1 was named 'keyword', this query was

{
  "query": {
    "match": {
      "my_text_field_1.keyword": {
        "query": "exact_match_for_the_keyword_text",
        "type": "phrase"
      }
    }
  }
}

which elicited the Elasticsearch response:

"Cannot search on field [my_text_field_1.keyword] since it is not indexed.".

In version 5.6 it worked, and in version 6.2.4 it did not.

Because I was out of ideas of what I had done wrong, I simply made it look exactly like the example in the docs (plus adding eager global ordinals for speed). The only difference was changing the field name to 'raw'.

When the keyword field was named 'raw', this query ran:

{
  "query": {
    "match": {
      "my_text_field_1.raw": {
        "query": "exact_match_for_the_keyword_text",
        "type": "phrase"
      }
    }
  }
}

And this worked.

Issue 2: Keyword aggregations are much slower than before (UNSOLVED)

Ok, I ran the query that the visualization uses (shown below, field names obscured for privacy).

I copied this query from the visualization spy query tab in the visualization window. I ran it directly in the console and the query errored out and said it couldn't get a response from the server.

I ran it in curl (using the copy query as curl function in Kibana) and the query hung for 11 minutes, so I stopped it. The ES cluster was unresponsive after I did this. After several minutes I was able to restart 2/3 nodes using systemctl, but I did a hard power cycle on the third because systemctl was unable to stop elasticsearch after 20 mintues. Afterwards the cluster came up. While the cluster was hung the load in htop was very low. No unusual swapping. The ES data nodes have 32 gigs of memory for the JVM and the machine has 64 gigs. There is a coordinating and master-enabled node (no data) that has 16 gigs for the JVM out of 32 gigs on the machine.

The large size value in the query below is just because I wanted all the rows displayed, allowing sorting on the number fields across all rows. There are not actually 1000000 rows, more like 3000.

So I have no performance data without Kibana for you sadly.... But I can say it's very unhappy with this query. Running the kibana visualization never gave this behavior of causing the cluster to freeze but was, as I've said, extremely slow.

I can say that this did, for whatever reason, cause the entire cluster to freeze up... I understand it's a large query but it worked on 5.6 and I could run this query and and graph it in Kibana in around 20 seconds.

GET _search
    {
      "size": 0,
      "_source": {
        "excludes": []
      },
      "aggs": {
        "4": {
          "terms": {
            "field": "my_text_field_1.raw",
            "size": 1000000,
            "order": {
              "1": "desc"
            }
          },
          "aggs": {
            "1": {
              "sum": {
                "field": "my_num_field_1"
              }
            },
            "28": {
              "terms": {
                "field": "my_text_field_2.raw",
                "size": 10000000,
                "order": {
                  "6": "desc"
                }
              },
              "aggs": {
                "6": {
                  "sum": {
                    "field": "my_num_field_2"
                  }
                },
                "8": {
                  "terms": {
                    "field": "my_text_field_3.raw",
                    "size": 10000000,
                    "order": {
                      "6": "desc"
                    }
                  },
                  "aggs": {
                    "3": {
                      "terms": {
                        "field": "my_text_field_4.raw",
                        "size": 10000000,
                        "order": {
                          "_term": "desc"
                        }
                      },
                      "aggs": {
                        "1": {
                          "sum": {
                            "field": "my_num_field_1"
                          }
                        },
                        "6": {
                          "sum": {
                            "field": "my_num_field_3"
                          }
                        },
                        "7": {
                          "sum": {
                            "field": "my_num_field_4"
                          }
                        },
                        .... more of the same omitted for length .....
                        "36": {
                          "sum": {
                            "field": "my_num_field_21"
                          }
                        }
                      }
                    },
                    "6": {
                      "sum": {
                        "field": "my_num_field_22"
                      }
                    }
                  }
                }
              }
            }
          }
        }
      },
      "stored_fields": [
        "*"
      ],
      "script_fields": {},
      "docvalue_fields": [
        "@timestamp"
      ],
      "query": {
        "bool": {
          "must": [
            {
              "query_string": {
                "analyze_wildcard": true,
                "query": "*",
                "default_field": "*"
              }
            },
            {
              "range": {
                "@timestamp": {
                  "gte": 1525179591745,
                  "lte": 1525180491746,
                  "format": "epoch_millis"
                }
              }
            }
          ],
          "filter": [],
          "should": [],
          "must_not": []
        }
      }
    }

Changing the name of the keyword should not affect the query failures, but it did.

Can you reproduce that from scratch and provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem.

Keyword aggregations are much slower than before (UNSOLVED)

"size": 1000000 This is bad IMO.

Are you saying that you believe "size":1000000 is the problem?

This setting is unchanged from what I used in elasticsearch 5.6.2, and it worked fine there. Moreover, if I change it to 4000, which is a safe upper bound on the number of rows today (since I do indeed want to show the entire system), it still doesn't work. The query works fine for tables with a small number of rows (even with size set to 1000000).

I'm reporting a serious change in behavior for large keyword aggregations like the above, which worked fine before, but no longer works basically at all, taking several minutes to load when queried through kibana's visualization and causing my cluster to hang (even needing to hard cycle one of the machines because systemctl could not shut the process down) when sent through Kibana devtools or cURL.

Can you try without any size for all aggs?

When I remove the

"size": 4000,
"order": {
  "6": "desc"
}

for each agg, it appears to silently set size to 10 and returns quickly a very incomplete set of data. Not sure how it picks the 10 keywords.

Just for kicks, I retried doing the cURL and Kibana devtools queries with all term aggs set to size 4000 all ordered "6": desc. I found that what appears to cause elasticsearch to hang and not respond to systemctl stop or kill SIGTERM is if I start running the devtools query while Kibana has not yet finished loading the table. (I did this again today out of impatience and had to kill SIGKILL each of my data-enabled nodes when the cluster stopped working).

If I let Kibana finish doing the size 4000 query, this takes 2 minutes 45 seconds for 15 minutes of data. If I run cURL or devtools on the size 4000 query, the query indeed completes and takes about 25 seconds.

So while I don't think ES is working perfectly (it hangs for at least 20 minutes in the above scenario, and probably would continue to hang, but I eventually SIGKILLed it), the main culprit may be Kibana. I will crosspost there.

By default a terms agg returns indeed the 10 most frequent terms.
Your test confirmed what I suspected that:

"size": 1000000 This is bad IMO.

May be this would help: Terms aggregation | Elasticsearch Guide [8.11] | Elastic

Thanks for the article, I'll look through that today and see how I can integrate that with the kibana dashboard JSON.

Just to be clear for anyone else who reads this, setting "size" to something much bigger than the number of terms does not appear to cause a performance decrease. In particular, in my example, setting size to 20000 results in a query that takes 22.6 seconds for 15 minutes, and setting size to 1000000 results in a query that takes 21.6 seconds for another 15 minute period. The query response sizes were the same, which is to say that a size of 20000 was large enough to capture my full data set. Notes on methodology: The 15 minute periods were non-overlapping so that caching did not affect the results, but the document counts were within .2% of each other, so they were of comparable size.

The main problem appears to be that the 6.2 update to Kibana replaced the old, fast implementation of the table with a new, inefficient one and it is not a priority to fix that, I guess: see my question here.

I appreciate all the help in troubleshooting.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.