Can't scale ElasticSearch cluster

mxmtrms · September 19, 2021, 5:22pm

Hello everyone!

Prerequisites

For the last year my team and I have been developing a project in which we decided to use ElasticSearch. In the early stages of development, we used a simple single-node cluster that more than satisfied our needs. We know ES scales very well vertically and that was fine for us. In the later stages, we even did stress testing and the results were excellent.

However, we are now entering production. To do this, we need to migrate large amounts of data from legacy systems. During the migration, we faced the need to increase the capacity of the cluster, but we were unable to.

A brief overview of the problem

Behavior on the initial cluster (9 data nodes): high CPU consumption on all data nodes (98-99%), medium-sized search queues (100-200 requests), while at a certain level of input load, the latency starts to grow, but RPS remains the same (the search queue is also growing).

Achieved RPS:

2к for search
300 for indexing

The behavior of the doubled cluster (18 data nodes, the number of shards remained the same, but now there is one shard on each node) has not changed in any way: the same high CPU consumption, the same RPS and the same search queue.

Load profile

In our migration process, we interact with one index in ES. We simultaneously perform both search operations and indexing operations. Our migration microservice processes data in parallel mode (at least 500 threads in total for all instances), when processing one message, it makes several search queries (about 5-10) and, based on the results of these queries, makes a decision to update an existing document or create a new one (in theory at the beginning of the process, we should mainly create documents, and by the end of the process we should mainly find). It is worth mentioning two most important features: first, we need to write new documents as quickly as possible, since the quality of the final migration depends on this (that is, we do not use bulk indexing and write in separate documents), and second, we do not have a storage period for documents (quantity documents during migration and in production - the value is monotonically increasing)

RPS targets:

10k RPS for search
2k RPS for indexing

(both search and indexing must occur simultaneously in the same index)

Index structure and types of queries

We use 100% structured search (with a couple of simple parsers). We have 4 business entities (say A, B, C, D), each of which has its own set of attributes.
Generalized form of mapping:

{
    "index-name": {
        "mappings": {
            "dynamic": "strict",
            "properties": {
                // entity A attributes
                "value-attribute": {
                    "properties": {
                        "id": {
                            "type": "long"
                        },
                        "value": {
                            "type": "keyword/long/date"
                        }
                    }
                },
                "value-array-attribute": {
                    "type": "nested",
                    "properties": {
                        "id": {
                            "type": "long"
                        },
                        "value": {
                            "type": "keyword/long/date"
                        }
                    }
                },
                "object-array-attribute": {
                    "type": "nested",
                    "properties": {
                        "id": {
                            "type": "long"
                        },
                        "value": {
                            "properties": {
                                "inner-object-array-attribute-1": {
                                    // any attributes
                                },
                                "inner-object-array-attribute-2": {
                                    // any attributes
                                }
                            }
                        }
                    }
                },
                ...
                "entity-B": {
                    "type": "nested",
                    "properties": {
                        // entity B attributes
                        ...
                        "entity-C": {
                            "type": "nested",
                            "properties": {
                                // entity C attributes (or value, or value-array, or object-array)
                                ...
                            }
                        },
                        "entity-D": {
                            "type": "nested",
                            "properties": {
                                // entity D attributes (or value, or value-array, or object-array)
                                ...
                            }
                        }
                    }
                }
            }
        }
    }
}

It looks complicated (there are many nested fields), but in reality we have no more than 10-15 attributes in each entity (almost all attributes are ordinary values), and we search only for the most important 2-3 attributes in each entity.
Generalized form of queries:

{
    "query": {
        "bool": {
            "filter": [
                {
                    "match": {
                        "attribute-name.value": "some-value"
                    }
                },
                ...
                {
                    "nested": {
                        "path": "entity-B",
                        "query": {
                            "bool": {
                                "filter": [
                                    {
                                        "match": {
                                            "entity-B.attribute-name.value": "some-value"
                                        }
                                    },
                                    ...
                                    {
                                        "nested": {
                                            "path": "entity-B.entity-C",
                                            "query": {
                                                "bool": {
                                                    "filter": [
                                                        {
                                                            "match": {
                                                                "entity-B.entity-C.attribute-name.value": "some-value"
                                                            }
                                                        }
                                                        ...
                                                    ]
                                                }
                                            }
                                        }
                                    ]
                                }
                            }
                        }
                    }
                ]
            }
        }
    }

Cluster structure and index settings.
Unfortunately, all of our cluster nodes are deployed on OpenStack virtual machines. We have 3 separate masters and 9 separate data nodes. We execute requests to data nodes through the NGINX balancer.

Characteristics of each node:

24 core CPU
256GB RAM (31GB for ES heap)
2 TB SSD disk
ElasticSearch 7.6.2

Also on all nodes: swap is disabled, ulimit -n 65535 is set, security policies are disabled.

Index settings:

9 primary shards (each about 80GB)
the relpication factor is 1
refresh interval is 5 seconds

Questions.

Why did not a twofold increase in the cluster yield any results?
If I apply the approach described in this article, will I be able to achieve some positive results?

I tried to provide as much information as possible, but if there are any clarifying questions, I will be glad to answer.

I would be grateful for any help.

P.S. Sorry for my english

Christian_Dahlqvist · September 20, 2021, 4:17am

This sounds like a quite unusual usecase for Elasticsearch. Elasticsearch can scale well horisontally, but that requires a suitable workload. Scaling Elasticsearch for a pure search or indexing load is relatively easy. If you combine searching and indexing, especially if you have an update-heavy load, it gets trickier as the operations will be affecting eachother. I think I can see a number of issues with your scenario that will prevent you from effectively scaling out your work load:

in order to scale read throughput you typically make sure your data is cached in the operating system page cache to as great extent as possible and then add search capacity by increasing the number of replica shards and as a result the number of searches that can be performed in parallel by the cluster. Increasing the number of replica shards however has a negative impact on indexing speed as more data need to be replicated and written for each indexing/update operation.
When you index/update and query the same data at the same time the writes will affect the operating system page cache which queries rely on for good performance. There are also a number of internal data structures that will need to be rebuilt frequently, which will add overhead.
Indexing individual documents into Elasticsearch is very inefficient compared to using bulk requests and does generally result in writes scaling badly. It adds a lot of overhead and can result in a lot of inefficient small segments. This is especially true of you are refreshing frequently in order to make sure that updates are visible for searches.
When you are using nested documents, each nested object is behind the scenes stored as a separate document in Elasticsearch. When you update, add or change anything in the hierarchy the entire document is reindexed, which results in all sub-documents also being reindexed. As documents increase in size and complexity, updates will therefore get more and more expensive.

No, this does not seem appropriate as you have an update heavy load.

system · October 18, 2021, 4:17am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch Bulk indexing is not scaling Elasticsearch	7	2977	July 5, 2017
Scaling: Cluster for speed or for size? Elasticsearch	6	399	July 6, 2017
Why does scaling Elastic Search on 2 nodes gives no performance gain? Elasticsearch	7	1183	July 5, 2017
Cluster optimization(indexing/query performace) Elasticsearch	4	348	July 6, 2017
Elastic Search Random Node High Load Elasticsearch	3	471	July 6, 2017

Can't scale ElasticSearch cluster

Related topics