Performance of using Elasticsearch to search for people

Problem Context

I am trying to use Elasticsearch to implement user search in a platform. However its taking way too long to search. usually around 100 ms - 200 ms. I am trying to achieve something closer to 40 ms maximum.

My cluster has 4 data nodes and 3 dedicated master nodes

I created my users index (having around 60 million records) with the default number of shards (5), with 1 replica and with the following mapping:

{
"settings": {
    "analysis": {
      "analyzer": {
        "name_ar_analyzer": {
          "tokenizer": "standard",
          "filter": ["arabic_normalization","trim"]
        },
        "name_full_ar_analyzer": {
          "tokenizer": "keyword",
          "filter": ["arabic_normalization", "trim"]
        },
        "name_en_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding", "trim"]
        },
        "name_full_en_analyzer": {
          "tokenizer": "keyword",
          "filter": ["lowercase", "asciifolding", "trim"]
        }
      }
    }
  }, 
  "mappings": {
		"dynamic": false,
    "properties": {
      "name": {
        "type": "text",
        "fields": {
         "full" : {
            "type": "text",
            "analyzer": "name_full_en_analyzer"
          }
        },
        "analyzer": "name_en_analyzer"
      },
      "picture": {
        "type": "text", 
        "index": false
      },
      "@timestamp": {
        "type": "date"
      }
    }
  }
}

I am using this search query

{
	"query": {
		"bool": {
			"must": [
				{
					"constant_score": {
						"filter": {
							"match": {
								"name": {
									"query": "John Do",
									"fuzziness": "auto"
								}
							}
						},
						"boost": 1
					}
				}
			],
			"should": [
				{
					"constant_score": {
						"filter": {
							"match": {
								"name.full": {
									"query": "John Do"
								}
							}
						},
						"boost": 10
					}
				},
				{
					"constant_score": {
						"filter": {
							"match": {
								"name": {
									"query": "John Do",
									"fuzziness": "auto",
									"operator": "and"
								}
							}
						},
						"boost": 5
					}
				},
				{
					"constant_score": {
						"filter": {
							"match_phrase_prefix": {
								"name": {
									"query": "John Do",
									"slop": 1,
									"max_expansions": 5
								}
							}
						},
						"boost": 10
					}
				}
			]
		}
	}
}

the reasoning of this query is as follows

  • everything is in a "constant_score" to bypass TF - IDF because when you search for "john", intuitively, it should match people who's accounts are "John" not "John John John".

  • the should contains a match boost for matching the name.full field exactly because for query "John do", "John do" is a better match than results with the name: "john John" or "John"

  • the second should condition is because for query "John Doe", matches with both terms are more important than matches with one of them like "John smith" and "Jane doe"

  • the third should condition (match_phrase_prefix), is to allow us to get results like "John Doe" for the query "John D" or "John Do".

so basically, full exact matches, and matched phrases (even if uncompleted) with are more important than random documents that have one match term multiple times.

Any Advice ?

  • Is the query itself a reason for the slow performance ? can it be written in a better way that achieves the wanted results ?
  • is there anything in the index settings that should be changed for better performance ?
  • Is the cluster suited for such a performance expectation ?

While not directly answering any of your questions, I'd recommend using something like: Profile API to get an idea on what your query is actually spending its time on to get an idea of where you could have options for improvements/bottlenecks.

(Note: I don't deal with Elasticsearch queries like you're running, so I can't really answer any of your questions directly, but I think the profile API is probably a good starting point. Kibana has a nice UI for using this as well if you don't want to read the raw JSON: Profile queries and aggregations | Kibana Guide [8.3] | Elastic)

Another note, you didn't mention what Elasticsearch version you're on, but you do say:

with the default number of shards (5), with 1 replica

This hasn't been the default for newer versions of Elasticsearch for a while. I'd recommend also looking into upgrading to the latest version of Elasticsearch as there has been a lot of improvements for performance.

But a note (that I believe is correct), in theory you can have 1 shard per CPU on your data nodes, so if you have something like 4 CPU per data node, you could in theory scale to something like 16 shards (4 nodes X 4 CPU) to get increased throughput on the index. (Just make sure the disk layer is actually able to handle something like this).

yes, unfortunately I am required to stay on version 7.1. I hope this is not the reason for the performance issues. I will definitely be trying thr profile API, thank you for that suggestion !

per elasticsearch you can have approximately 20 shards per gig of heap per node
you can also add a replica for 5 shards and 2 replicas this can help increase search speed, be careful increasing it beyond 2 replicas it can cause performance issues

TIP: The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.

Maybe this will help as well tune for search speed

thank you for the tip about sharding. Can you please confirm whether the following is true tho ?

If shards get queried in parallel threads inside a data node, then having 4 data nodes, each with 2cVPUs means no more than 8 shards in the cluster can be searched at the same time.

So is there any benefit in having more than 8 shards or would that just be oversharding ? I know that more than 20 shards per GB heap is oversharding (each node here has 16GB RAM so i guess its fine), but what is oversharding from a CPU / thread perspective ?

Some additional context would be useful.

Which version of Elasticsearch are you using?

What is the size of the index you are querying?

What is the size of other data held on the nodes?

Are you adding, deleting or updating data in the index while you are querying?

What is the specification of the nodes in terms of CPU, RAM, heap size and type of storage used?

Is there any other load on the cluster at the same time you are running your queries?

More shards allow more work to be done in parallel but at the same time adds overhead in terms of coordination, so increasing shards may not necessarily improve performance.

I wrote the blog post Kevin linked to and this provides a guideline for the maximum number of shards to hold on a node in older versions of Elasticsearch. This was created as a lot of users with index heavy use cases using time-based indices ended up extremely oversharded, which causes stability and performance problems. It does not say that you should aim to have a certain number of shards per node.

Thank you for your reply,

Im using version 7.10 (unfortunately i cannot upgrade)

the index total size is 5.4gb, primaries size is 2.7gb. has about 60 million documents.

currently its my only index but if it works out, i will be adding more similar indices in size

unfortunately there are always write / delete operations to the indices, so I need to be able to query it while this happens in the background

the nodes have 2vCPUs, 16GB RAM using nvme SSD storage. I have 3 dedicated master nodes & 4 data nodes.

my search queries & the updates to the index are the only load on this cluster

I hope this helps

What is your heap size set to?

How much is this index expected to grow over time?

heap size is 8GB in data nodes, 5 GB in master nodes

For a search use case it is important that you benchmark and make decisions based on the full data set the cluster will be required to hold and the full query load it must support, as this will affect latencies and settings.

2.7GB in primary shards is not very large in Elasticsearch terms so I would recommend changing the index to a single primary shard and initially 3 replicas so that all nodes hold a copy of the shard. As each node will at this point hold around 2.7GB of data it should fit in the operating system page cache, which Elasticsearch relies on for performance.

As you run queries the data should get cached and latencies drop. Each query will be served by a single thread on a single data node as you have 1 primary shard. This will give you a baseline for performance. As only one thread is busy with each query, this will allow you to serve a reasonable number of concurrent queries. If you are expecting a large number of concurrent queries having a single primary shard per index is likely the optimal unless they grow too large, especially as you have few CPU cores per node.

As you add indices and data to the cluster, the data per node will at some point exceed the size of the operating system page cache, which will result in more disk I/O. You seem to have fast disks, which is good, but it is still generally slower than reading from the page cache.

At this point there are a few things you can do. It is generally recommended to use the smallest heap you can get away with without causing long or frequent GC, so if your heap is not strained (check monitoring) you may reduce it on the data nodes, which will give more space to the page cache. You may also reduce the number of replicas per index from 3 to 1. This means only 2 nodes hold a copy of each shard, but you can now fit twice the amount of data into the cluster. This may also lead to improved cache hit ratio.

If you on the other hand are only expecting occasional queries and do not expect a high concurrency (or if query latencies are still too high even with all data in the cache), a higher humber of primary shards may be appropriate, especially as you are using fuzziness which an be CPU intensive and add latency. In this case I would recommend slowly increasing the primary shard count and benchmark at every step to see what is optimal.

While you do this, also use the analyze API to profile your queries. I would not be surprised if the fuzziness is adding significantly to the latency seen.

1 Like

The difference between 40ms and 100ms is basically imperceptible to an end-user so my guess is you have some other reason why it needs to be that low. Once you start getting sub 100ms in your efforts to increase speed, you're going to have to really look at the specs of your nodes as well. Also make sure you are truly profiling just the time it takes the cluster to give results and that you aren't also adding network lag. I say this because I've made that mistake in the past.

You can get sub 100ms, but you're going to want to use NVMe drives with a large IOPs rating, a large amount of RAM for page caching, etc.

You're getting into response times where hardware specs play a huge role. It is one thing to have a 20 second response time and wanting to get below 1 second but a different matter entirely when you request to go from 100-200 ms to 40ms.

1 Like

I just got it to return responses in about 70 ms by removing fuzziness in the second should query. But now i need to keep at it until it gives me an average response time of 40 ms.

I think i will be revising my instances as well. They do use NVMes already, i think having 8 vCPUs in total in the cluster is a bit low tho.

Ill also be playing around with the shards. I tried 4 primaries with replicas = 1. I will be trying an index with 1 primary shard and 3 replicas to see if performance will be better

Also, every use case is different. I've found when we index Reddit and other large social media platforms that it really does pay dividends to just experiment with different settings in a dev environment. For instance, there are pros and cons to reducing / increasing the number of shards per index -- so you may want to experiment there and try to keep the number of variables that change between each test to a minimum.

Also, you might "theoretically" achieve sub 40ms queries in your use case -- but it might not hold up in production where demands are going to be radically different from your testing environment. One thing I did was develop a way to duplicate production requests and funnel them into a dev environment so I could truly gauge how changes would affect production. If you have some type of SLA where you need the queries to be under X ms, it would probably be worth it for you to develop a mechanism where you can divert a copy of production queries into your dev / testing environment to see if your < 40ms design holds up under production demands.

Good luck!

I am thankful for everyone who gave me advice on this thread !

I was able to reach 20ms - 40ms times (these include network time) by

  • stop indexing all the time time and limit it to an hourly basis
  • disabling auto refresh of the index and manually triggering it after a bulk index.
  • configuring the index with 1 primary shard & 3 replicas (since i have 4 data nodes)
  • setting fuzziness to 'auto:4,8' as opposed to have it always equal to 1 or 2.

Hope this helps anyone who ran into something similar

Congrats on hitting your target! I have a question though about your level of replication.

Did you try one replica and then two before going to three? If so, were your search times worse with just one or two levels of replication? I know replication allows for more concurrent searches against the same indexes because other nodes can run searches against replica shards which should theoretically give more performance under heavy loads, but was this necessary to reach your stated times / goal?

3 replicas per primary shard is a high level of redundancy but also incurs a storage penalty of requiring around twice the amount of space compared to having just one replica configured. Then again, your cluster can survive basically over half of its nodes going down and still be able to search 100% of the data which is a very high level of redundancy.

Anyway, kudos! It is always cool to hear from people who set ambitious goals with ES and find a way to realize those goals in production. I think everyone here who has done work in devops knows very well that the theoretical models we create in dev seldom translate to real-world performance in production. :wink:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.