High system load during search when total index size is large

smlbiobot · May 30, 2021, 5:48pm

I have a cluster setup where some nodes can experience high system load for a specific index.

Index

The index is time-based, and is named as such:

index-2021.03
index-2021.04
index-2021.05

In other words, I begin a new index with each month.

Index Settings

The index is setup with

5 primaries
1 replica.

In other words, there are 10 shards total for the index.

Each index is approximately 500GB (that is, each shard is approximately 50GB)

Data Nodes

There are 7 hot data nodes within the cluster. These are the only nodes that contain the indices mentioned above.

Config

Each data node has

32GB Ram
640GB SSD storage
Heap: 16GB
Number of shards per node: ~248

Search

The search performs aggregation against the indices. They are filtered. That is, each query contains a bool filter with a set time range. (the rest of the query is truncated from this post as it is quite long and contains complex aggregations)

{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "timestamp": {
              "gte": "now-7d/h",
              "lte": "now/h",
              "format": "epoch_millis"
            }
          }
        }
      ]
    }
  }
}

Search is made against index-*

Expectations

As I understand it, if I perform a search this way and each index only contains documents within that time range, then it shouldn’t matter how large the index is. My expectation is that the shards from the older indices will simply be ignored in these search queries because of the timestamp filter.

Reality

The system load got excessively high for many days. However, once I have deleted some old indices, the system load returns to normal, e.g.:

DELETE index-2021.03
DELETE index-2021.04

Questions

What could contribute this?
If I understand correctly, each search is performed against 5 shards at any time — though if there are more indices, am I searching against more shards because there are only 7 data nodes total?
Should I use more data nodes? (though that seems counter intuitive)
Should I split the index up into smaller indices? Your docs recommend 50GB per shard, which these shards meet
Should I use beefier data nodes? Is a data node with the twice the specs of what I am currently using going to be better than me running 2 data nodes?

warkolm · May 31, 2021, 6:45am

What do the Monitoring stats show when you run these queries?
What about hot threads?

smlbiobot · May 31, 2021, 3:18pm

The monitoring stats exceed the norm. The norm for any one particular node is under 8. The node has 16 cores and the load went up to 24. CPU at 100%. For the hot threads I don’t know how to read it properly. I wish that there are docs.

Unfortunately I didn’t save them but I believe it’s related to search. Likely aggregations. These are complex queries. But the point remains that I don’t understand why the filter context does not seem to exclude the docs / indices if removing indices for the query seems to relieve the load.

Christian_Dahlqvist · May 31, 2021, 3:39pm

I would start with a simple part of the aggregation if possible and measure the speed of that and then gradually make it more complex while measuring performance. This will give you an idea whether it is cluster or query related as well as potentially which part is most expensive.

smlbiobot · May 31, 2021, 3:51pm

I mainly want to understand why:

Same query with time filter
Higher load: when there are more documents in the index (multiple indices)
Lower load: when there are less documents in the index (fewer indices)

The queries are time-filtered so theoretically speaking I should be able to have up to a year of record and the cost should be the same, no? If that’s not the case then please let me know.

Additionally / alternatively

Should I configure my indices differently? Should I configure my data nodes differently? For example:

Is it better have beefier data nodes with more CPUs + RAM instead of more nodes with less CPUs + RAM. Conventional wisdom suggests that beefier might be better but I am concerned about node failure and also ease of me to expand data storage capacity gradually when need arises.

smlbiobot · June 20, 2021, 1:39pm

@Christian_Dahlqvist I have just experienced this again. This comes and goes but my host insists that there’s nothing wrong with my servers, and so I hope that you can shed some light.

Here’s the hot threads output from one of the node that significantly exceeded its normal load:

gist.github.com

https://gist.github.com/smlbiobot/4300885e86a0b2af4f8898da0fed9ac0

esnode4-hotthreads.txt

::: {esnode4}{LSKHTPRoTCOWAKJAYQB70Q}{q_eVAy06Roy5SkV4UlaFeA}{xxx.xxx.xxx.xxx}{xxx.xxx.xxx.xxx:9300}{ml.machine_memory=33728798720, ml.max_open_jobs=20, xpack.installed=true, box_type=hot, ml.enabled=true}
   Hot threads at 2021-06-20T13:34:49.390Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   83.4% (416.7ms out of 500ms) cpu usage by thread 'elasticsearch[esnode4][search][T#16]'
     4/10 snapshots sharing following 46 elements
       app//org.apache.lucene.util.PriorityQueue.downHeap(PriorityQueue.java:275)
       app//org.apache.lucene.util.PriorityQueue.updateTop(PriorityQueue.java:202)
       app//org.apache.lucene.index.OrdinalMap.<init>(OrdinalMap.java:261)
       app//org.apache.lucene.index.OrdinalMap.build(OrdinalMap.java:168)
       app//org.apache.lucene.index.OrdinalMap.build(OrdinalMap.java:147)

This file has been truncated. show original

smlbiobot · June 28, 2021, 4:28pm

Update: I have turned on index slow search log and discovered that slow query is doing a boolean search similar to this:

{
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "category.id": {
                    "value": 70000000,
                    "boost": 1.0
                  }
                }
              },
              {
                "term": {
                  "category.id": {
                    "value": 70000001,
                    "boost": 1.0
                  }
                }
              }
            ],
            "adjust_pure_negative": true,
            "minimum_should_match": "1",
            "boost": 1.0
          }
        },
        {
          "range": {
            "timestamp": {
              "from": "now-7d/h",
              "to": "now/h",
              "include_lower": true,
              "include_upper": true,
              "boost": 1.0
            }
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "product.name.keyword": {
              "value": "foobar",
              "boost": 1.0
            }
          }
        }
      ],
      "should": [
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "apple",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "orange",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "chocolate",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "hazelnut",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "salt",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "milk",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "sugar",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "coconut",
              "boost": 1.0
            }
          }
        }
      ],
      "adjust_pure_negative": true,
      "minimum_should_match": "6",
      "boost": 1.0
    }
  },
  "_source": {
    "includes": [
      "product"
    ]
  },
  "size": 0,
  "aggs": {
    "similar_products": {
      "terms": {
        "field": "product.name.keyword",
        "size": 10
      }
    }
  }
}

The actual search is extremely similar (but not related to food). It’s taking 8 seconds to perform this query but it seems pretty simple — should it take that long normally? Search is against about 150 million total documents, but the bool filtered documents within 7 days should return no more than 2 million results.

smlbiobot · June 28, 2021, 6:29pm

Update: i have found the culprit, it’s the terms aggregation part… however why does it take 5-6 seconds to do this, and is there anyway to speed it up?

"aggs": {
    "similar_products": {
      "terms": {
        "field": "product.name.keyword",
        "size": 10
      }
    }
  }

smlbiobot · June 28, 2021, 6:56pm

I have solved the issue. Mostly thanks to this blog post you wrote:

Specifically, I have enabled eager_global_ordinals on the field and lengthening the refresh interval. It is now working very well. If I run into issues again i will update this issue.

Thanks — I know that this is mostly me adding updates to this topic, but I couldn’t have figured this out if not for that blog post!

system · July 26, 2021, 6:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch performs slowly when data size increased Elasticsearch	3	934	March 21, 2017
Slow search in Kibana / ElasticSearch Elasticsearch	15	1748	April 3, 2022
10% CPU load without querying Elasticsearch	7	458	July 6, 2017
How to speed up query searching on text on an index with 50M+ documents? Elasticsearch	13	1072	June 29, 2020
Investigating elasticsearch load issues Elasticsearch	9	1303	July 6, 2017

High system load during search when total index size is large

Related topics