Identifying/fixing latency (95th percentile between 1-4.5 seconds)

jader201 · July 17, 2019, 4:33am

We have had Elasticsearch [1] running in a production environment for a while, and started doing some analysis on performance. Unfortunately, our performance seems to be very hit or miss.

Our 50th percentile is usually around 500ms, but from there, it gets more erratic, with 90th ranging from 1-3 seconds, 95th from 1-4.5 seconds, and 99th all over the place, from 2 to almost 20 seconds! And as far as load, it's usually between 0.5 and 2 requests per second, so nothing ridiculous.

The cluster is made up of 3 nodes (deployed on Kubernetes stateful sets) with 64GB memory & 500GB storage each. The configuration is 3 shards, 2 replicas, and 9 indexes. Each index varies in number of docs and width, but here's the breakdown:

index	pri	rep	docs.count	docs.deleted	store.size	pri.store.size
1	3	2	501	0	4mb	1.3mb
2	3	2	517752	2413	1.8gb	630.9mb
3	3	2	6755	107	23.5mb	7.9mb
4	3	2	201879	60	749.7mb	251.1mb
5	3	2	50	40	3mb	1mb
6	3	2	1457952	10796	3.7gb	1.2gb
7	3	2	3012	2	8.1mb	2.7mb
8	3	2	82035	14303	949mb	309.2mb
9	3	2	3211492	123674	5.8gb	1.9gb

We have an ngram analyzer/filter that used to have min_gram: 2 and max_gram: 12, but I reduced max_gram to 6 in hopes that that had something to do with our latency, but it didn't seem to help much (this happened n-6 days on the graph above).

A lot of our requests are querying across _all indexes, but we also have requests querying individual indexes with similar performance. In spot tests, I'm seeing took come back ranging between a few ms, to over 1 second with the same query, within a few seconds of each other. I'm seeing this both on _all and individual indexes.

So, my questions:

Is what I'm seeing above seem pretty normal, or does something look off?
If something is off, where should I start? Node/shard/replica configuration? How data is getting indexed (i.e. ngram)? Node CPU/memory allocation?
Is there something I can do to further instrument (or maybe already instrumented that I could dig more into) that would help explain what's going on?
What other information can I provide here that would be helpful?

[1]

version: {
    number: "6.4.2",
    build_flavor: "default",
    build_type: "tar",
    build_hash: "04711c2",
    build_date: "2018-09-26T13:34:09.098244Z",
    build_snapshot: false,
    lucene_version: "7.4.0",
    minimum_wire_compatibility_version: "5.6.0",
    minimum_index_compatibility_version: "5.0.0"
}

Christian_Dahlqvist · July 17, 2019, 5:53am

It seems you are updating and or deleting data. Do the spikes correlate with indexing load?

jader201 · July 17, 2019, 4:03pm

I broke this down by hours over the past 7 days, and then overlayed the relative number of documents that are being indexed during each hour.

While there is possibly some correlation between the small spikes every other hour, the larger spikes in latency usually do not correlate with indexing. Particularly, if you look between 7/15 and 7/17, there are several big spikes in latency, but the indexing remains consistent. In fact, between 7/16 and 7/17, there was a bug the prevented a lot of documents getting indexed, and search performance was still very spiky during this time.

Conversely, if you look at one of the really big spike in indexing around 7/12, there was a small correlation to moderate latency spiking, but not near the extent seen in 7/16 or 7/17. The other big index spike on 7/11 has a bit closer correlation, with a decent spike in latency during that time, so it seems like indexing can affect latency, but it seems like some of the more dramatic spikes with latency is not correlated to indexing.

Christian_Dahlqvist · July 17, 2019, 4:24pm

Anything around long GC or merging in the logs that correlates/

jader201 · July 17, 2019, 9:30pm

Thanks for the lead.

Looking over the logs (unfortunately, just the last ~24 hours), I am seeing a lot of [young] entries (about 180 over the span of an hour) around the time of spikes. Most of the durations are between 700-1000ms (avg = 743ms). If it matters, almost all of the are [INFO] level severity, with about 4 [WARN] level severities.

After that ~1 hour window, there are almost no [young] GC entries for the next several hours. There are a total of about 9 [young] entries, and those could also possibly tie to other spikes. But there are also windows of what I consider relatively high 90th/95th latencies that are missing [young] entries.

Does this sound like it could be the cause? If so, what can I do about it?

jader201 · July 19, 2019, 4:26pm

Wanted to bump this, in case @Christian_Dahlqvist or any others have some tips on what I can do further to isolate the latency.

Also, does anyone know if ES offers one-time incident support outside of their subscription model? I'm assuming not, since I can't find any mentions of support outside of their subscription model, but thought I wouldn't hurt to ask here.

We need to figure out this latency, as search times even over 1 second aren't a great UX, let alone 5-20 seconds. And according to our metrics, 10+% of our requests are 1 second or more.

Christian_Dahlqvist · July 20, 2019, 8:34am

Another aspect to consider is also the queries. How uniform and controlled are they? Is there any chance some queries could include something more expensive, e.g. a leading or trailing wildcard clause supplied by a user? Have you set up the slow log to capture the slow queries?

jader201 · July 20, 2019, 3:03pm

Thanks again for your help, @Christian_Dahlqvist.

I have a job set up to track the latency of a single consistent query every 60 seconds, and results can definitely be mixed, though not sure N is big enough yet to correlate anything. Also, looking over our logs, almost all searches are a simple search term (no wildcards), so seems like most queries are pretty basic. So that leads me to think it's not specific to longer queries, but would like to rule this out by capturing more N.

And thanks for the tip on slow log -- I'll definitely look into setting this up, and report back.

Christian_Dahlqvist · July 20, 2019, 3:17pm

These indices are very small. Have you tried setting them up with a single primary shard instead of 3?

jader201 · July 26, 2019, 3:47pm

@Christian_Dahlqvist just another note of thanks for your guidance. It's been helpful, and I'm starting to get a clearer picture of what's going on thanks to your tips.

So I moved from 3 shard/2 replica configuration to 1 shard/2 replica config, and things are indeed looking a little better. But still seeing 99th between 1.5-3 seconds.

Interestingly enough, I also set up slow logging, and have found that there are some discrepancies between our request latency and what's showing up in slow logs, suggesting the bottleneck is somewhere else (either network or other work being done beyond querying).

So I'm setting up more tracing on our endpoints to continue to hone in on the source of the latency. But feel like we're getting closer.

Thanks again!

system · August 23, 2019, 3:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search latency & index latency for elasticsearch Elasticsearch	2	6252	November 5, 2018
Running into Elasticsearch high search latency 5-10s issue in production Elasticsearch	13	3341	July 5, 2017
Elasticsearch high latency Elasticsearch	16	4729	June 8, 2023
Gradual increase in latency Elasticsearch	4	545	July 5, 2017
ES performance reliablity issue Elasticsearch	7	542	October 9, 2020

Identifying/fixing latency (95th percentile between 1-4.5 seconds)

Related topics