High system load during search when total index size is large

I have a cluster setup where some nodes can experience high system load for a specific index.

Index

The index is time-based, and is named as such:

index-2021.03
index-2021.04
index-2021.05

In other words, I begin a new index with each month.

Index Settings

The index is setup with

  • 5 primaries
  • 1 replica.

In other words, there are 10 shards total for the index.

Each index is approximately 500GB (that is, each shard is approximately 50GB)

Data Nodes

There are 7 hot data nodes within the cluster. These are the only nodes that contain the indices mentioned above.

Config

Each data node has

  • 32GB Ram
  • 640GB SSD storage
  • Heap: 16GB
  • Number of shards per node: ~248

Search

The search performs aggregation against the indices. They are filtered. That is, each query contains a bool filter with a set time range. (the rest of the query is truncated from this post as it is quite long and contains complex aggregations)

{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "timestamp": {
              "gte": "now-7d/h",
              "lte": "now/h",
              "format": "epoch_millis"
            }
          }
        }
      ]
    }
  }
}

Search is made against index-*

Expectations

As I understand it, if I perform a search this way and each index only contains documents within that time range, then it shouldn’t matter how large the index is. My expectation is that the shards from the older indices will simply be ignored in these search queries because of the timestamp filter.

Reality

The system load got excessively high for many days. However, once I have deleted some old indices, the system load returns to normal, e.g.:

DELETE index-2021.03
DELETE index-2021.04

Questions

  • What could contribute this?
  • If I understand correctly, each search is performed against 5 shards at any time — though if there are more indices, am I searching against more shards because there are only 7 data nodes total?
  • Should I use more data nodes? (though that seems counter intuitive)
  • Should I split the index up into smaller indices? Your docs recommend 50GB per shard, which these shards meet
  • Should I use beefier data nodes? Is a data node with the twice the specs of what I am currently using going to be better than me running 2 data nodes?

What do the Monitoring stats show when you run these queries?
What about hot threads?

The monitoring stats exceed the norm. The norm for any one particular node is under 8. The node has 16 cores and the load went up to 24. CPU at 100%. For the hot threads I don’t know how to read it properly. I wish that there are docs.

Unfortunately I didn’t save them but I believe it’s related to search. Likely aggregations. These are complex queries. But the point remains that I don’t understand why the filter context does not seem to exclude the docs / indices if removing indices for the query seems to relieve the load.

I would start with a simple part of the aggregation if possible and measure the speed of that and then gradually make it more complex while measuring performance. This will give you an idea whether it is cluster or query related as well as potentially which part is most expensive.

1 Like

I mainly want to understand why:

  • Same query with time filter
  • Higher load: when there are more documents in the index (multiple indices)
  • Lower load: when there are less documents in the index (fewer indices)

The queries are time-filtered so theoretically speaking I should be able to have up to a year of record and the cost should be the same, no? If that’s not the case then please let me know.

Additionally / alternatively

Should I configure my indices differently? Should I configure my data nodes differently? For example:

  • Is it better have beefier data nodes with more CPUs + RAM instead of more nodes with less CPUs + RAM. Conventional wisdom suggests that beefier might be better but I am concerned about node failure and also ease of me to expand data storage capacity gradually when need arises.

@Christian_Dahlqvist I have just experienced this again. This comes and goes but my host insists that there’s nothing wrong with my servers, and so I hope that you can shed some light.

Here’s the hot threads output from one of the node that significantly exceeded its normal load:

Update: I have turned on index slow search log and discovered that slow query is doing a boolean search similar to this:

{
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "category.id": {
                    "value": 70000000,
                    "boost": 1.0
                  }
                }
              },
              {
                "term": {
                  "category.id": {
                    "value": 70000001,
                    "boost": 1.0
                  }
                }
              }
            ],
            "adjust_pure_negative": true,
            "minimum_should_match": "1",
            "boost": 1.0
          }
        },
        {
          "range": {
            "timestamp": {
              "from": "now-7d/h",
              "to": "now/h",
              "include_lower": true,
              "include_upper": true,
              "boost": 1.0
            }
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "product.name.keyword": {
              "value": "foobar",
              "boost": 1.0
            }
          }
        }
      ],
      "should": [
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "apple",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "orange",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "chocolate",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "hazelnut",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "salt",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "milk",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "sugar",
              "boost": 1.0
            }
          }
        },
        {
          "term": {
            "product.ingredients.key.keyword": {
              "value": "coconut",
              "boost": 1.0
            }
          }
        }
      ],
      "adjust_pure_negative": true,
      "minimum_should_match": "6",
      "boost": 1.0
    }
  },
  "_source": {
    "includes": [
      "product"
    ]
  },
  "size": 0,
  "aggs": {
    "similar_products": {
      "terms": {
        "field": "product.name.keyword",
        "size": 10
      }
    }
  }
}

The actual search is extremely similar (but not related to food). It’s taking 8 seconds to perform this query but it seems pretty simple — should it take that long normally? Search is against about 150 million total documents, but the bool filtered documents within 7 days should return no more than 2 million results.

Update: i have found the culprit, it’s the terms aggregation part… however why does it take 5-6 seconds to do this, and is there anyway to speed it up?

"aggs": {
    "similar_products": {
      "terms": {
        "field": "product.name.keyword",
        "size": 10
      }
    }
  }

I have solved the issue. Mostly thanks to this blog post you wrote:

Specifically, I have enabled eager_global_ordinals on the field and lengthening the refresh interval. It is now working very well. If I run into issues again i will update this issue.

Thanks — I know that this is mostly me adding updates to this topic, but I couldn’t have figured this out if not for that blog post!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.