Identifying/fixing latency (95th percentile between 1-4.5 seconds)

We have had Elasticsearch [1] running in a production environment for a while, and started doing some analysis on performance. Unfortunately, our performance seems to be very hit or miss.

Our 50th percentile is usually around 500ms, but from there, it gets more erratic, with 90th ranging from 1-3 seconds, 95th from 1-4.5 seconds, and 99th all over the place, from 2 to almost 20 seconds! And as far as load, it's usually between 0.5 and 2 requests per second, so nothing ridiculous.

The cluster is made up of 3 nodes (deployed on Kubernetes stateful sets) with 64GB memory & 500GB storage each. The configuration is 3 shards, 2 replicas, and 9 indexes. Each index varies in number of docs and width, but here's the breakdown:

index pri rep docs.count docs.deleted store.size pri.store.size
1 3 2 501 0 4mb 1.3mb
2 3 2 517752 2413 1.8gb 630.9mb
3 3 2 6755 107 23.5mb 7.9mb
4 3 2 201879 60 749.7mb 251.1mb
5 3 2 50 40 3mb 1mb
6 3 2 1457952 10796 3.7gb 1.2gb
7 3 2 3012 2 8.1mb 2.7mb
8 3 2 82035 14303 949mb 309.2mb
9 3 2 3211492 123674 5.8gb 1.9gb

We have an ngram analyzer/filter that used to have min_gram: 2 and max_gram: 12, but I reduced max_gram to 6 in hopes that that had something to do with our latency, but it didn't seem to help much (this happened n-6 days on the graph above).

A lot of our requests are querying across _all indexes, but we also have requests querying individual indexes with similar performance. In spot tests, I'm seeing took come back ranging between a few ms, to over 1 second with the same query, within a few seconds of each other. I'm seeing this both on _all and individual indexes.

So, my questions:

  1. Is what I'm seeing above seem pretty normal, or does something look off?

  2. If something is off, where should I start? Node/shard/replica configuration? How data is getting indexed (i.e. ngram)? Node CPU/memory allocation?

  3. Is there something I can do to further instrument (or maybe already instrumented that I could dig more into) that would help explain what's going on?

  4. What other information can I provide here that would be helpful?

[1]

version: {
    number: "6.4.2",
    build_flavor: "default",
    build_type: "tar",
    build_hash: "04711c2",
    build_date: "2018-09-26T13:34:09.098244Z",
    build_snapshot: false,
    lucene_version: "7.4.0",
    minimum_wire_compatibility_version: "5.6.0",
    minimum_index_compatibility_version: "5.0.0"
}

It seems you are updating and or deleting data. Do the spikes correlate with indexing load?

I broke this down by hours over the past 7 days, and then overlayed the relative number of documents that are being indexed during each hour.

While there is possibly some correlation between the small spikes every other hour, the larger spikes in latency usually do not correlate with indexing. Particularly, if you look between 7/15 and 7/17, there are several big spikes in latency, but the indexing remains consistent. In fact, between 7/16 and 7/17, there was a bug the prevented a lot of documents getting indexed, and search performance was still very spiky during this time.

Conversely, if you look at one of the really big spike in indexing around 7/12, there was a small correlation to moderate latency spiking, but not near the extent seen in 7/16 or 7/17. The other big index spike on 7/11 has a bit closer correlation, with a decent spike in latency during that time, so it seems like indexing can affect latency, but it seems like some of the more dramatic spikes with latency is not correlated to indexing.

Anything around long GC or merging in the logs that correlates/

Thanks for the lead.

Looking over the logs (unfortunately, just the last ~24 hours), I am seeing a lot of [young] entries (about 180 over the span of an hour) around the time of spikes. Most of the durations are between 700-1000ms (avg = 743ms). If it matters, almost all of the are [INFO] level severity, with about 4 [WARN] level severities.

After that ~1 hour window, there are almost no [young] GC entries for the next several hours. There are a total of about 9 [young] entries, and those could also possibly tie to other spikes. But there are also windows of what I consider relatively high 90th/95th latencies that are missing [young] entries.

Does this sound like it could be the cause? If so, what can I do about it?

Wanted to bump this, in case @Christian_Dahlqvist or any others have some tips on what I can do further to isolate the latency.

Also, does anyone know if ES offers one-time incident support outside of their subscription model? I'm assuming not, since I can't find any mentions of support outside of their subscription model, but thought I wouldn't hurt to ask here.

We need to figure out this latency, as search times even over 1 second aren't a great UX, let alone 5-20 seconds. And according to our metrics, 10+% of our requests are 1 second or more.

Another aspect to consider is also the queries. How uniform and controlled are they? Is there any chance some queries could include something more expensive, e.g. a leading or trailing wildcard clause supplied by a user? Have you set up the slow log to capture the slow queries?

Thanks again for your help, @Christian_Dahlqvist.

I have a job set up to track the latency of a single consistent query every 60 seconds, and results can definitely be mixed, though not sure N is big enough yet to correlate anything. Also, looking over our logs, almost all searches are a simple search term (no wildcards), so seems like most queries are pretty basic. So that leads me to think it's not specific to longer queries, but would like to rule this out by capturing more N.

And thanks for the tip on slow log -- I'll definitely look into setting this up, and report back.

These indices are very small. Have you tried setting them up with a single primary shard instead of 3?

@Christian_Dahlqvist just another note of thanks for your guidance. It's been helpful, and I'm starting to get a clearer picture of what's going on thanks to your tips.

So I moved from 3 shard/2 replica configuration to 1 shard/2 replica config, and things are indeed looking a little better. But still seeing 99th between 1.5-3 seconds.

Interestingly enough, I also set up slow logging, and have found that there are some discrepancies between our request latency and what's showing up in slow logs, suggesting the bottleneck is somewhere else (either network or other work being done beyond querying).

So I'm setting up more tracing on our endpoints to continue to hone in on the source of the latency. But feel like we're getting closer.

Thanks again!