Can't scale ElasticSearch cluster

Hello everyone!

Prerequisites

For the last year my team and I have been developing a project in which we decided to use ElasticSearch. In the early stages of development, we used a simple single-node cluster that more than satisfied our needs. We know ES scales very well vertically and that was fine for us. In the later stages, we even did stress testing and the results were excellent.

However, we are now entering production. To do this, we need to migrate large amounts of data from legacy systems. During the migration, we faced the need to increase the capacity of the cluster, but we were unable to.

A brief overview of the problem

Behavior on the initial cluster (9 data nodes): high CPU consumption on all data nodes (98-99%), medium-sized search queues (100-200 requests), while at a certain level of input load, the latency starts to grow, but RPS remains the same (the search queue is also growing).

Achieved RPS:

  • 2к for search
  • 300 for indexing

The behavior of the doubled cluster (18 data nodes, the number of shards remained the same, but now there is one shard on each node) has not changed in any way: the same high CPU consumption, the same RPS and the same search queue.

Load profile

In our migration process, we interact with one index in ES. We simultaneously perform both search operations and indexing operations. Our migration microservice processes data in parallel mode (at least 500 threads in total for all instances), when processing one message, it makes several search queries (about 5-10) and, based on the results of these queries, makes a decision to update an existing document or create a new one (in theory at the beginning of the process, we should mainly create documents, and by the end of the process we should mainly find). It is worth mentioning two most important features: first, we need to write new documents as quickly as possible, since the quality of the final migration depends on this (that is, we do not use bulk indexing and write in separate documents), and second, we do not have a storage period for documents (quantity documents during migration and in production - the value is monotonically increasing)

RPS targets:

  • 10k RPS for search
  • 2k RPS for indexing

(both search and indexing must occur simultaneously in the same index)

Index structure and types of queries

We use 100% structured search (with a couple of simple parsers). We have 4 business entities (say A, B, C, D), each of which has its own set of attributes.
Generalized form of mapping:

{
    "index-name": {
        "mappings": {
            "dynamic": "strict",
            "properties": {
                // entity A attributes
                "value-attribute": {
                    "properties": {
                        "id": {
                            "type": "long"
                        },
                        "value": {
                            "type": "keyword/long/date"
                        }
                    }
                },
                "value-array-attribute": {
                    "type": "nested",
                    "properties": {
                        "id": {
                            "type": "long"
                        },
                        "value": {
                            "type": "keyword/long/date"
                        }
                    }
                },
                "object-array-attribute": {
                    "type": "nested",
                    "properties": {
                        "id": {
                            "type": "long"
                        },
                        "value": {
                            "properties": {
                                "inner-object-array-attribute-1": {
                                    // any attributes
                                },
                                "inner-object-array-attribute-2": {
                                    // any attributes
                                }
                            }
                        }
                    }
                },
                ...
                "entity-B": {
                    "type": "nested",
                    "properties": {
                        // entity B attributes
                        ...
                        "entity-C": {
                            "type": "nested",
                            "properties": {
                                // entity C attributes (or value, or value-array, or object-array)
                                ...
                            }
                        },
                        "entity-D": {
                            "type": "nested",
                            "properties": {
                                // entity D attributes (or value, or value-array, or object-array)
                                ...
                            }
                        }
                    }
                }
            }
        }
    }
}

It looks complicated (there are many nested fields), but in reality we have no more than 10-15 attributes in each entity (almost all attributes are ordinary values), and we search only for the most important 2-3 attributes in each entity.
Generalized form of queries:

{
    "query": {
        "bool": {
            "filter": [
                {
                    "match": {
                        "attribute-name.value": "some-value"
                    }
                },
                ...
                {
                    "nested": {
                        "path": "entity-B",
                        "query": {
                            "bool": {
                                "filter": [
                                    {
                                        "match": {
                                            "entity-B.attribute-name.value": "some-value"
                                        }
                                    },
                                    ...
                                    {
                                        "nested": {
                                            "path": "entity-B.entity-C",
                                            "query": {
                                                "bool": {
                                                    "filter": [
                                                        {
                                                            "match": {
                                                                "entity-B.entity-C.attribute-name.value": "some-value"
                                                            }
                                                        }
                                                        ...
                                                    ]
                                                }
                                            }
                                        }
                                    ]
                                }
                            }
                        }
                    }
                ]
            }
        }
    }

Cluster structure and index settings.
Unfortunately, all of our cluster nodes are deployed on OpenStack virtual machines. We have 3 separate masters and 9 separate data nodes. We execute requests to data nodes through the NGINX balancer.

Characteristics of each node:

  • 24 core CPU
  • 256GB RAM (31GB for ES heap)
  • 2 TB SSD disk
  • ElasticSearch 7.6.2

Also on all nodes: swap is disabled, ulimit -n 65535 is set, security policies are disabled.

Index settings:

  • 9 primary shards (each about 80GB)
  • the relpication factor is 1
  • refresh interval is 5 seconds

Questions.

  1. Why did not a twofold increase in the cluster yield any results?
  2. If I apply the approach described in this article, will I be able to achieve some positive results?

I tried to provide as much information as possible, but if there are any clarifying questions, I will be glad to answer.

I would be grateful for any help.

P.S. Sorry for my english :slight_smile:

This sounds like a quite unusual usecase for Elasticsearch. Elasticsearch can scale well horisontally, but that requires a suitable workload. Scaling Elasticsearch for a pure search or indexing load is relatively easy. If you combine searching and indexing, especially if you have an update-heavy load, it gets trickier as the operations will be affecting eachother. I think I can see a number of issues with your scenario that will prevent you from effectively scaling out your work load:

  1. in order to scale read throughput you typically make sure your data is cached in the operating system page cache to as great extent as possible and then add search capacity by increasing the number of replica shards and as a result the number of searches that can be performed in parallel by the cluster. Increasing the number of replica shards however has a negative impact on indexing speed as more data need to be replicated and written for each indexing/update operation.
  2. When you index/update and query the same data at the same time the writes will affect the operating system page cache which queries rely on for good performance. There are also a number of internal data structures that will need to be rebuilt frequently, which will add overhead.
  3. Indexing individual documents into Elasticsearch is very inefficient compared to using bulk requests and does generally result in writes scaling badly. It adds a lot of overhead and can result in a lot of inefficient small segments. This is especially true of you are refreshing frequently in order to make sure that updates are visible for searches.
  4. When you are using nested documents, each nested object is behind the scenes stored as a separate document in Elasticsearch. When you update, add or change anything in the hierarchy the entire document is reindexed, which results in all sub-documents also being reindexed. As documents increase in size and complexity, updates will therefore get more and more expensive.

No, this does not seem appropriate as you have an update heavy load.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.