Hello everyone!
Prerequisites
For the last year my team and I have been developing a project in which we decided to use ElasticSearch. In the early stages of development, we used a simple single-node cluster that more than satisfied our needs. We know ES scales very well vertically and that was fine for us. In the later stages, we even did stress testing and the results were excellent.
However, we are now entering production. To do this, we need to migrate large amounts of data from legacy systems. During the migration, we faced the need to increase the capacity of the cluster, but we were unable to.
A brief overview of the problem
Behavior on the initial cluster (9 data nodes): high CPU consumption on all data nodes (98-99%), medium-sized search queues (100-200 requests), while at a certain level of input load, the latency starts to grow, but RPS remains the same (the search queue is also growing).
Achieved RPS:
- 2к for search
- 300 for indexing
The behavior of the doubled cluster (18 data nodes, the number of shards remained the same, but now there is one shard on each node) has not changed in any way: the same high CPU consumption, the same RPS and the same search queue.
Load profile
In our migration process, we interact with one index in ES. We simultaneously perform both search operations and indexing operations. Our migration microservice processes data in parallel mode (at least 500 threads in total for all instances), when processing one message, it makes several search queries (about 5-10) and, based on the results of these queries, makes a decision to update an existing document or create a new one (in theory at the beginning of the process, we should mainly create documents, and by the end of the process we should mainly find). It is worth mentioning two most important features: first, we need to write new documents as quickly as possible, since the quality of the final migration depends on this (that is, we do not use bulk indexing and write in separate documents), and second, we do not have a storage period for documents (quantity documents during migration and in production - the value is monotonically increasing)
RPS targets:
- 10k RPS for search
- 2k RPS for indexing
(both search and indexing must occur simultaneously in the same index)
Index structure and types of queries
We use 100% structured search (with a couple of simple parsers). We have 4 business entities (say A, B, C, D), each of which has its own set of attributes.
Generalized form of mapping:
{
"index-name": {
"mappings": {
"dynamic": "strict",
"properties": {
// entity A attributes
"value-attribute": {
"properties": {
"id": {
"type": "long"
},
"value": {
"type": "keyword/long/date"
}
}
},
"value-array-attribute": {
"type": "nested",
"properties": {
"id": {
"type": "long"
},
"value": {
"type": "keyword/long/date"
}
}
},
"object-array-attribute": {
"type": "nested",
"properties": {
"id": {
"type": "long"
},
"value": {
"properties": {
"inner-object-array-attribute-1": {
// any attributes
},
"inner-object-array-attribute-2": {
// any attributes
}
}
}
}
},
...
"entity-B": {
"type": "nested",
"properties": {
// entity B attributes
...
"entity-C": {
"type": "nested",
"properties": {
// entity C attributes (or value, or value-array, or object-array)
...
}
},
"entity-D": {
"type": "nested",
"properties": {
// entity D attributes (or value, or value-array, or object-array)
...
}
}
}
}
}
}
}
}
It looks complicated (there are many nested fields), but in reality we have no more than 10-15 attributes in each entity (almost all attributes are ordinary values), and we search only for the most important 2-3 attributes in each entity.
Generalized form of queries:
{
"query": {
"bool": {
"filter": [
{
"match": {
"attribute-name.value": "some-value"
}
},
...
{
"nested": {
"path": "entity-B",
"query": {
"bool": {
"filter": [
{
"match": {
"entity-B.attribute-name.value": "some-value"
}
},
...
{
"nested": {
"path": "entity-B.entity-C",
"query": {
"bool": {
"filter": [
{
"match": {
"entity-B.entity-C.attribute-name.value": "some-value"
}
}
...
]
}
}
}
]
}
}
}
}
]
}
}
}
Cluster structure and index settings.
Unfortunately, all of our cluster nodes are deployed on OpenStack virtual machines. We have 3 separate masters and 9 separate data nodes. We execute requests to data nodes through the NGINX balancer.
Characteristics of each node:
- 24 core CPU
- 256GB RAM (31GB for ES heap)
- 2 TB SSD disk
- ElasticSearch 7.6.2
Also on all nodes: swap is disabled, ulimit -n 65535 is set, security policies are disabled.
Index settings:
- 9 primary shards (each about 80GB)
- the relpication factor is 1
- refresh interval is 5 seconds
Questions.
- Why did not a twofold increase in the cluster yield any results?
- If I apply the approach described in this article, will I be able to achieve some positive results?
I tried to provide as much information as possible, but if there are any clarifying questions, I will be glad to answer.
I would be grateful for any help.
P.S. Sorry for my english