Seamingly sporadic increases in indexing latency

rex-remind · January 14, 2021, 6:19pm

We have a setup where Flink is writing to Elasticsearch, and at sporadic moments latency increases due to volume of IOPS. Looking for suggestions.

Attached are some graphs, when IOPS are high, indexing latency is high, and the backpressure means Flink sends far less bulk indexing requests. This in turn dramatically drops our indexing rate. It would appear that certain requests require a lot more IOPS. What suggestions are there to tune or adjust hardware for this scenario?

Thanks

Marked times ~7:00 to demonstrate that lower IOPS == low latency == more requests == dramatically higher indexing rate

warkolm · January 14, 2021, 11:54pm

Is there anything in your Elasticsearch logs at the time you are seeing this?
Can you also check hot_threads?

rex-remind · January 15, 2021, 1:01am

Body limit is too high to post hot_threads. How should I best interpret what I have, or is there a subsection I should post?

warkolm · January 15, 2021, 1:02am

Just use gist/pastebin/etc and link here.

rex-remind · January 15, 2021, 1:08am

gist.github.com

https://gist.github.com/rex-remind101/b6b1a8d1b82d601d343203b0ea44add2

gistfile1.txt

# Attempt 1

[hadoop@ip-10-128-4-195 ~]$ curl <host>/_nodes/hot_threads
::: {741248536f8a624a46358affbf3940d9}{lunuJX_4QM-QSGxUuAK6aQ}{K6N5Dqy9TUOjP5v5nWfonw}{x.x.x.x}{x.x.x.x:9300}{dir}{distributed_snapshot_deletion_enabled=true, zone=us-east-1d, cross_cluster_transport_address=x.x.x.x, }
   Hot threads at 2021-01-15T00:58:24.674Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   76.1% (380.4ms out of 500ms) cpu usage by thread 'elasticsearch[741248536f8a624a46358affbf3940d9][refresh][T#2]'
     5/10 snapshots sharing following 36 elements
       app//org.apache.lucene.index.PointValuesWriter$1.intersect(PointValuesWriter.java:89)
       app//org.apache.lucene.index.PointValuesWriter$MutableSortingPointValues.intersect(PointValuesWriter.java:203)

This file has been truncated. show original

warkolm · January 15, 2021, 1:11am

You've got fair bit of activity on refresh threads. What sort of storage is this on?

Also, what's app//[AMAZON INTERNAL] about?

rex-remind · January 15, 2021, 1:21am

AWS - EBS gp2 volumes. SSD.
Guessing amazon has mixed in some special sauce.

warkolm · January 15, 2021, 1:22am

Is this the aws service?

rex-remind · January 15, 2021, 1:24am

refresh interval is set to 1s because plan is to soon launch to users and we need low latency searches. It is expected to have similar read and write volume since users will search, and then perform some action that will then change the what's in ES.

rex-remind · January 15, 2021, 1:24am

Yes, it is AWS service.

warkolm · January 15, 2021, 1:26am

Ok thanks. You will need to ask the then sorry, they run a custom fork and we don't know what changes they've made around how they provide it.

rex-remind · January 15, 2021, 1:27am

Hmm. The fact that you immediately pointed me to hot_threads and they did not when I asked for help gives me less confidence in them, but much in you. Assuming this wasn't AWS, what path do you think I should follow?

warkolm · January 15, 2021, 1:29am

I would change my storage type and see if that helps.

You can also try out https://www.elastic.co/cloud/ and see if that performs better. We provide a fair bit more on top of the core Elasticsearch product as well.

rex-remind · January 15, 2021, 1:30am

Thanks

Christian_Dahlqvist · January 15, 2021, 8:14am

I would not be surprised if those patterns were caused by occasional larger merges, which can be I/O intensive.

rex-remind · January 15, 2021, 6:31pm

We now believe latency increases are related to the garbage collector collecting garbage (at least this appears to be 1 significant factor). We're on ES 7.8, anything recommendations?

rex-remind · January 15, 2021, 8:16pm

May be jumping the gun on garbage collection. The graphs line up in some places but not others. Are there any good recommendations for dealing with large merges?

warkolm · January 15, 2021, 8:26pm

Faster storage is really the only option.

rex-remind · January 15, 2021, 10:10pm

Ok, will definitely go with that route.

Just noticed something else though, looking at latency metrics per node, only 2 of our nodes have high latency, and cpu is higher on those 2 nodes (but wouldn't say high). Our Flink job has a parallelism of 2. I assumed ES load balances write bulk requests, is this not the case?

Shard allocation between shards looks relatively even, though two shards are slightly larger than the rest

<index>       4     r      STARTED 30349634  8.5gb x.x.x.x  aff85c512e914917c29faa78df1c6831
<index>       4     p      STARTED 30349634  8.3gb x.x.x.x  55d01cfcbccb60959a00b0ad437a9b36
<index>       1     r      STARTED 30348482  8.6gb x.x.x.x   f17701275c28068a719b2be5e29e5d06
<index>       1     p      STARTED 30348482  8.5gb x.x.x.x  55d01cfcbccb60959a00b0ad437a9b36
<index>       3     p      STARTED 30345095 10.4gb x.x.x.x  d21c192794d2d51cbb2b5951fdcb18fe
<index>       3     r      STARTED 30345095 10.9gb x.x.x.x 6cf4a3474c179f2dc2721d120eb380a3
<index>       2     r      STARTED 30340807  9.9gb x.x.x.x   f17701275c28068a719b2be5e29e5d06
<index>       2     p      STARTED 30340807  9.9gb x.x.x.x 6cf4a3474c179f2dc2721d120eb380a3
<index>       0     p      STARTED 30344671    8gb x.x.x.x c802556d0108779bb85051acac16b771
<index>       0     r      STARTED 30344671  8.2gb x.x.x.x  d21c192794d2d51cbb2b5951fdcb18fe

warkolm · January 16, 2021, 5:21am

Once the node that gets it parses the request, it sends data to each relevant shard in the cluster. It doesn't take (eg) half of the request and pass it to another node straight away.

Topic		Replies	Views
Huge performance degradation during bulk indexing Elasticsearch	8	3483	May 16, 2019
Tune for indexing speed Elasticsearch	11	1551	January 2, 2023
Elasticsearch 7.17.10 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS Elasticsearch	53	1590	June 22, 2023
Running into Elasticsearch high search latency 5-10s issue in production Elasticsearch	13	3341	July 5, 2017
Identifying/fixing latency (95th percentile between 1-4.5 seconds) Elasticsearch	10	1588	August 23, 2019

Seamingly sporadic increases in indexing latency

Related topics