CPU loading: 100% under load after 10-12 hrs of usage

Hello, we're switching from ES 5.5 to 7.12. After applying it works perfect for about 10-12 hrs: and then we get CPU loading: 100%
Even simple queries: GET /products/_search?size=0 works for 5-6 seconds.
And it returns to the normal state in a couple of seconds when we switch back.

ES version: 7.12
number_of_shards: 18
number_of_replicas: 2
nodes number: 9
index size: 267 Gb
RAM: 32GB
Xms16g
Xmx16g

CPUs: 8
Number of documents: 9581754
50-80 documents indexing per second (no bulk used)

mapping

    {
      "products" : {
        "mappings" : {
          "dynamic" : "strict",
          "properties" : {
            "categories" : {
              "type" : "long"
            },
            "categoryId" : {
              "type" : "long"
            },
            "ean" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword"
                }
              }
            },
            "features" : {
              "type" : "nested",
              "dynamic" : "strict",
              "properties" : {
                "categoryFeatureId" : {
                  "type" : "long"
                },
                "value" : {
                  "type" : "keyword",
                  "eager_global_ordinals" : true
                }
              }
            },
            "localizedData" : {
              "type" : "nested",
              "dynamic" : "strict",
              "properties" : {
                "category" : {
                  "type" : "text",
                  "fields" : {
                    "keyword" : {
                      "type" : "keyword"
                    }
                  }
                },
                "languageShortCode" : {
                  "type" : "keyword"
                },
                "lifeCycle" : {
                  "type" : "date",
                  "format" : "yyyy-MM-dd HH:mm:ss"
                },
                "supplier" : {
                  "type" : "text",
                  "fields" : {
                    "keyword" : {
                      "type" : "keyword"
                    }
                  }
                },
                "title" : {
                  "type" : "text"
                },
              }
            },
            "mpn" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword"
                }
              },
              "analyzer" : "autoCompleteAnalyzer"
            },
            "numericalFeatures" : {
              "type" : "nested",
              "dynamic" : "strict",
              "properties" : {
                "categoryFeatureId" : {
                  "type" : "long"
                },
                "value" : {
                  "type" : "float"
                }
              }
            },
            "price" : {
              "type" : "float"
            },
            "supplierId" : {
              "type" : "long"
            }
          }
        }
      }
    }

two typical queries used for calculating filters and for getting filtered results

    {
      "aggs": {
        "all": {
          "global": {},
          "aggs": {
            "main": {
              "filter": {
                "bool": {
                  "must": [
                    {
                      "nested": {
                        "path": "localizedData",
                        "query": {
                          "bool": {
                            "must": [
                              {
                                "term": {
                                  "localizedData.languageShortCode": {
                                    "value": "en"
                                  }
                                }
                              },
                              {
                                "bool": {
                                  "should": [
                                    {
                                      "range": {
                                        "localizedData.lifeCycle": {
                                          "lte": "2021-04-23 08:23:32"
                                        }
                                      }
                                    },
                                    {
                                      "bool": {
                                        "must_not": {
                                          "exists": {
                                            "field": "localizedData.lifeCycle"
                                          }
                                        }
                                      }
                                    }
                                  ]
                                }
                              }
                            ]
                          }
                        }
                      }
                    }
                  ]
                }
              },
              "aggs": {
                "general": {
                  "filter": {
                    "bool": {
                      "must": [
                        {
                          "nested": {
                            "path": "features",
                            "query": {
                              "bool": {
                                "must": [
                                  {
                                    "match": {
                                      "features.categoryFeatureId": "36303"
                                    }
                                  },
                                  {
                                    "match": {
                                      "features.value": "y"
                                    }
                                  }
                                ]
                              }
                            }
                          }
                        },
                        {
                          "bool": {
                            "should": [
                              {
                                "term": {
                                  "categoryId": {
                                    "value": 151,
                                    "boost": 1000
                                  }
                                }
                              },
                              {
                                "term": {
                                  "categories": {
                                    "value": 151
                                  }
                                }
                              }
                            ]
                          }
                        }
                      ]
                    }
                  },
                  "aggs": {
                    "price": {
                      "stats": {
                        "field": "price"
                      }
                    },
                    "cat": {
                      "terms": {
                        "field": "categories",
                        "size": 30
                      }
                    },
                    "man": {
                      "terms": {
                        "field": "supplierId",
                        "size": 30
                      }
                    },
                    "nested_features": {
                      "nested": {
                        "path": "features"
                      },
                      "aggs": {
                        "featureId": {
                          "terms": {
                            "field": "features.categoryFeatureId",
                            "size": 100
                          },
                          "aggs": {
                            "featureVal": {
                              "terms": {
                                "field": "features.value",
                                "size": 100
                              }
                            }
                          }
                        }
                      }
                    },
                    "nested_numericalFeatures": {
                      "nested": {
                        "path": "numericalFeatures"
                      },
                      "aggs": {
                        "featureId": {
                          "terms": {
                            "field": "numericalFeatures.categoryFeatureId",
                            "size": 100
                          },
                          "aggs": {
                            "featureVal": {
                              "stats": {
                                "field": "numericalFeatures.value"
                              }
                            }
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
    {
      "query": {
        "function_score": {
          "query": {
            "bool": {
              "must": [
                {
                  "nested": {
                    "path": "localizedData",
                    "query": {
                      "bool": {
                        "must": [
                          {
                            "term": {
                              "localizedData.languageShortCode": {
                                "value": "en"
                              }
                            }
                          },
                          {
                            "bool": {
                              "should": [
                                {
                                  "range": {
                                    "localizedData.lifeCycle": {
                                      "lte": "2021-04-23 08:26:21"
                                    }
                                  }
                                },
                                {
                                  "bool": {
                                    "must_not": {
                                      "exists": {
                                        "field": "localizedData.lifeCycle"
                                      }
                                    }
                                  }
                                }
                              ]
                            }
                          }
                        ]
                      }
                    }
                  }
                },
                {
                  "nested": {
                    "path": "features",
                    "query": {
                      "bool": {
                        "must": [
                          {
                            "match": {
                              "features.categoryFeatureId": "36303"
                            }
                          },
                          {
                            "match": {
                              "features.value": "y"
                            }
                          }
                        ]
                      }
                    }
                  }
                },
                {
                  "bool": {
                    "should": [
                      {
                        "term": {
                          "categoryId": {
                            "value": 151,
                            "boost": 1000
                          }
                        }
                      },
                      {
                        "term": {
                          "categories": {
                            "value": 151
                          }
                        }
                      }
                    ]
                  }
                }
              ]
            }
          },
          "functions": [
            {
              "filter": {
                "nested": {
                  "path": "features",
                  "query": {
                    "exists": {
                      "field": "features.categoryFeatureId"
                    }
                  }
                }
              },
              "weight": 50
            },
            {
              "filter": {
                "match": {
                  "quality": 2
                }
              },
              "weight": 20
            },
            {
              "filter": {
                "match": {
                  "quality": 1
                }
              },
              "weight": 10
            }
          ],
          "boost_mode": "sum"
        }
      }
    }

show stats uploaded

Please don't post pictures of text, they are difficult to read, impossible to search and replicate (if it's code), and some people may not be even able to see them :slight_smile:

Switch back to what exactly?

What is the output from the _cluster/stats?pretty&human API?

Thank Mark for reply.

Sorry for images. It was cheating from my side. Thought it was a good idea to avoid the character number limitation in post by putting some text into image))

Switch back I mean switch to ES 5.5.

However there is one more detail. It really depends on how hard we 're adding and updating the documents. After stop adding new documents it returns to normal state. As I said earlier we don't use batch.

Average size of json for single document is 30-50kb

_cluster/stats?pretty&human

{
  "_nodes" : {
    "total" : 15,
    "successful" : 15,
    "failed" : 0
  },
  "cluster_name" : "fo-elastic-7",
  "cluster_uuid" : "iqK7WjT3Sly2xfpTanYxqA",
  "timestamp" : 1619501194396,
  "status" : "green",
  "indices" : {
    "count" : 10,
    "shards" : {
      "total" : 122,
      "primaries" : 43,
      "replication" : 1.8372093023255813,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 54,
          "avg" : 12.2
        },
        "primaries" : {
          "min" : 1,
          "max" : 18,
          "avg" : 4.3
        },
        "replication" : {
          "min" : 1.0,
          "max" : 2.0,
          "avg" : 1.3
        }
      }
    },
    "docs" : {
      "count" : 802738940,
      "deleted" : 156500494
    },
    "store" : {
      "size" : "852.3gb",
      "size_in_bytes" : 915170532682,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "3.1mb",
      "memory_size_in_bytes" : 3336504,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "949.5mb",
      "memory_size_in_bytes" : 995719865,
      "total_count" : 17039615,
      "hit_count" : 349049,
      "miss_count" : 16690566,
      "cache_size" : 9562,
      "cache_count" : 9562,
      "evictions" : 0
    },
    "completion" : {
      "size" : "69.7gb",
      "size_in_bytes" : 74904024220
    },
    "segments" : {
      "count" : 1546,
      "memory" : "69.7gb",
      "memory_in_bytes" : 74919351030,
      "terms_memory" : "69.7gb",
      "terms_memory_in_bytes" : 74910786972,
      "stored_fields_memory" : "951.8kb",
      "stored_fields_memory_in_bytes" : 974656,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "761.2kb",
      "norms_memory_in_bytes" : 779520,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "6.4mb",
      "doc_values_memory_in_bytes" : 6809882,
      "index_writer_memory" : "234.6mb",
      "index_writer_memory_in_bytes" : 246066754,
      "version_map_memory" : "10.6mb",
      "version_map_memory_in_bytes" : 11173852,
      "fixed_bit_set" : "342.2mb",
      "fixed_bit_set_memory_in_bytes" : 358903912,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "boolean",
          "count" : 3,
          "index_count" : 2
        },
        {
          "name" : "completion",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "date",
          "count" : 12,
          "index_count" : 4
        },
        {
          "name" : "float",
          "count" : 2,
          "index_count" : 1
        },
        {
          "name" : "keyword",
          "count" : 39,
          "index_count" : 5
        },
        {
          "name" : "long",
          "count" : 16,
          "index_count" : 4
        },
        {
          "name" : "nested",
          "count" : 6,
          "index_count" : 2
        },
        {
          "name" : "object",
          "count" : 7,
          "index_count" : 2
        },
        {
          "name" : "text",
          "count" : 21,
          "index_count" : 5
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [
        {
          "name" : "edge_ngram",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "filter_types" : [ ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [
        {
          "name" : "standard",
          "count" : 1,
          "index_count" : 1
        }
      ]
    },
    "versions" : [
      {
        "version" : "7.12.0",
        "index_count" : 10,
        "primary_shard_count" : 43,
        "total_primary_size" : "284gb",
        "total_primary_bytes" : 304983067844
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 15,
      "coordinating_only" : 3,
      "data" : 9,
      "data_cold" : 0,
      "data_content" : 0,
      "data_frozen" : 0,
      "data_hot" : 0,
      "data_warm" : 0,
      "ingest" : 3,
      "master" : 3,
      "ml" : 0,
      "remote_cluster_client" : 0,
      "transform" : 0,
      "voting_only" : 0
    },
    "versions" : [
      "7.12.0"
    ],
    "os" : {
      "available_processors" : 102,
      "allocated_processors" : 102,
      "names" : [
        {
          "name" : "Linux",
          "count" : 15
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Debian GNU/Linux 10 (buster)",
          "count" : 15
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 15
        }
      ],
      "mem" : {
        "total" : "352.2gb",
        "total_in_bytes" : 378177728512,
        "free" : "12.1gb",
        "free_in_bytes" : 13011701760,
        "used" : "340gb",
        "used_in_bytes" : 365166026752,
        "free_percent" : 3,
        "used_percent" : 97
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 205
      },
      "open_file_descriptors" : {
        "min" : 612,
        "max" : 910,
        "avg" : 784
      }
    },
    "jvm" : {
      "max_uptime" : "6.8d",
      "max_uptime_in_millis" : 592985374,
      "versions" : [
        {
          "version" : "15.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 15
        }
      ],
      "mem" : {
        "heap_used" : "113.2gb",
        "heap_used_in_bytes" : 121643914456,
        "heap_max" : "180gb",
        "heap_max_in_bytes" : 193273528320
      },
      "threads" : 997
    },
    "fs" : {
      "total" : "1.8tb",
      "total_in_bytes" : 2043039191040,
      "free" : "947.9gb",
      "free_in_bytes" : 1017861378048,
      "available" : "942.5gb",
      "available_in_bytes" : 1012068225024
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 15
      },
      "http_types" : {
        "security4" : 15
      }
    },
    "discovery_types" : {
      "zen" : 15
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "deb",
        "count" : 15
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 2,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

That's far from ideal then. I would start there.

Thank you Mark,

Bulk queries ~7MB of data increased our speed of indexing.
Unfortunately problem is not gone.
But it seems like outages not related (or little related) to adding/modifying data.

What we have there:
hourly CPU usage increased from 10% to 100% on data nodes.
After reloading php-fpms CPU's load became normal for 1 hour.
If we don't reload php-fpms CPU load returns to normal after 1.5-2h

Logs on the http nodes say us sticsearch.action.search.SearchPhaseExecutionException: all shards failed
curl /_cluster/health/?level=shards says us that all is green and started

"status":"green","primary_active":true,"active_shards":3,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0

And Interesting that after reloading it works perfect for 1 hour. It makes me think that it's because some internal Elastic processes.

I would suggest using the latest 7.12.1, I've seen a few other topics with similar issues that an upgrade fixed.

Dear Mark,

Thanks a lot for your support. Unfortunately this didn't help.
We're going to play with mappings. And reduce number of nested objects.