Elasticsearch Increasing Write rejections

Tombal · August 18, 2021, 9:56pm

Hello,
I have 20 node platinum licenced ELK cluster. (Version 7.6.1)

Setup is stated as below.

16 data nodes --> each 64GB memory // 32gb allocated as Xms-Xmx jvm . --> 8TB disk each
3 master nodes --> 32gb memory each
1 Ml node --> 64 gb memory

My problem is i checked most of the index tuning advisory and threadpool write rejection cases somehow my rejections started to increase for a while , 2 nodes nearly have %5 rejection rate.

threadpool

My indexing rates are like as stated below , the spikes are huge but i get data from Database, syslog, log files and directly forwarded to my elasticsearch from fluentbit or logstash of some other services all sources have different behaviour.

My most used index template is as below. So most of the indexes are 4 primary and 1 replica.
Refresh interval may be better if higher value i guess.

> {
>   "main_temp" : {
>     "order" : 0,
>     "index_patterns" : [
>       "csm_siem_record_*",
>       "csm_siem_restored_*"
>     ],
>     "settings" : {
>       "index" : {
>         "codec" : "best_compression",
>         "refresh_interval" : "5s",
>         "analysis" : {
>           "filter" : {
>             "ntk_asciifolding" : {
>               "type" : "asciifolding",
>               "preserve_original" : "false"
>             },
>             "ntk_turkce_lowercase" : {
>               "type" : "lowercase",
>               "language" : "turkish"
>             }
>           },
>           "analyzer" : {
>             "raw_log" : {
>               "filter" : [
>                 "ntk_turkce_lowercase",
>                 "ntk_asciifolding"
>               ],
>               "type" : "custom",
>               "tokenizer" : "classic"
>             }
>           }
>         },
>         "number_of_shards" : "4",
>         "auto_expand_replicas" : "0-1",
>         "query" : {
>           "default_field" : "raw_record"
>         }
>       }
>     },
>     "mappings" : {
>       "dynamic_templates" : [
>         {
>           "notanalyzed" : {
>             "mapping" : {
>               "type" : "keyword"
>             },
>             "match_mapping_type" : "string",
>             "match" : "*"
>           }
>         }
>       ],
>       "date_detection" : false,
>       "properties" : {
>         "received_bytes" : {
>           "type" : "long"
>         },
>         "destination_port" : {
>           "type" : "integer"
>         },
>         "jvbrtt" : {
>           "type" : "short"
>         },
>         "sent_packets" : {
>           "type" : "long"
>         },
>         "responseData" : {
>           "ignore_above" : 8191,
>           "type" : "keyword"
>         },
>         "translated_source_ip" : {
>           "type" : "ip"
>         },
>         "raw_record" : {
>           "analyzer" : "raw_log",
>           "type" : "text"
>         },
>         "packets" : {
>           "type" : "long"
>         },
>         "session_time" : {
>           "type" : "long"
>         },
>         "source_ip" : {
>           "type" : "ip"
>         },
>         "sent_bytes" : {
>           "type" : "long"
>         },
>         "download" : {
>           "type" : "short"
>         },
>         "duration" : {
>           "type" : "long"
>         },
>         "destination_ip" : {
>           "type" : "ip"
>         },
>         "translated_destination_ip" : {
>           "type" : "ip"
>         },
>         "translated_source_port" : {
>           "type" : "integer"
>         },
>         "date_time" : {
>           "format" : "strict_date_optional_time||epoch_millis",
>           "type" : "date"
>         },
>         "translated_destination_port" : {
>           "type" : "integer"
>         },
>         "source_port" : {
>           "type" : "integer"
>         },
>         "connectionquality" : {
>           "type" : "float"
>         },
>         "responsetime" : {
>           "type" : "keyword"
>         },
>         "issuccessful" : {
>           "type" : "keyword"
>         },
>         "requestData" : {
>           "ignore_above" : 8191,
>           "type" : "keyword"
>         },
>         "received_packets" : {
>           "type" : "long"
>         },
>         "severity" : {
>           "type" : "integer"
>         },
>         "responsecode" : {
>           "type" : "keyword"
>         },
>         "coordinate" : {
>           "type" : "geo_point"
>         },
>         "destination_ip_coordinate" : {
>           "type" : "geo_point"
>         },
>         "count" : {
>           "type" : "long"
>         },
>         "maxframeheight" : {
>           "type" : "integer"
>         },
>         "translated_ip" : {
>           "type" : "ip"
>         },
>         "upload" : {
>           "type" : "short"
>         },
>         "resp_time" : {
>           "type" : "integer"
>         },
>         "message" : {
>           "analyzer" : "raw_log",
>           "type" : "text"
>         },
>         "isservererror" : {
>           "type" : "keyword"
>         },
>         "sign_time" : {
>           "format" : "strict_date_optional_time||epoch_millis",
>           "type" : "date"
>         },
>         "device_ip" : {
>           "type" : "ip"
>         },
>         "source_ip_coordinate" : {
>           "type" : "geo_point"
>         },
>         "upload__1" : {
>           "type" : "short"
>         },
>         "download__1" : {
>           "type" : "short"
>         },
>         "bytes" : {
>           "type" : "long"
>         },
>         "starttime" : {
>           "type" : "keyword"
>         },
>         "lastn" : {
>           "type" : "short"
>         },
>         "email_subject" : {
>           "analyzer" : "raw_log",
>           "type" : "text"
>         }
>       }
>     },
>     "aliases" : { }
>   }
> }

I will be really glad if you have further recommendations.

stephenb · August 18, 2021, 9:59pm

Hi @Tombal

As a platinum customer you are entitled to support on these kind of issues.

Have you filed a support ticket, Elastic support is not just break / fix they certainly can help take a look at this with you and help diagnose and make suggestions.

That would be my first suggestion.

stephenb · August 18, 2021, 10:01pm

Also my quick notice is that you say that you set the JVM to 32 GB,
That is actually not optimum the JVM needs to be set somewhere between 26 to 30 GB otherwise you are no longer using compressed object pointers and will actually make the system less performant.

See here

I'm not saying that's a source your problem but it's certainly not optimal.

Tombal · August 18, 2021, 10:06pm

Thx Stephen , i will have a look.

Christian_Dahlqvist · August 19, 2021, 5:19am

There have been improvements in this area in recent versions so it might be a good idea to upgrade.

Tombal · August 19, 2021, 6:57am

This is also on my schedule... Need to make sure every job working atm is completely suitable with newer versions.

By the way , made some adjustments on the worst situation nodes and after final restart and adding index_buffer_size parameter after 4 hours cluster got green. Now the rejection rate is %0.01 and even lower than 1 of 10000 . That is becoming more acceptable.

The highest queue size doesn't make it any better after 500 for me , so i lowered it to 400.
Buffer size parameter was not added so was default to %10 i guess.

thread_pool.write.queue_size: 400
indices.memory.index_buffer_size: 40%

I hope it works for some other people...

warkolm · August 19, 2021, 7:02am

That's not really a good idea. Take a look at Any idea what these errors mean version 2.4.2 - #5 by jasontedor for a good overview of why

Tombal · August 19, 2021, 7:51am

You are right , the higher queue makes it worse . It doesn't fix the problem. It is still out there.
So i lowered it , maybe i'll remove the setting as well , trying every possible combination.

My other data nodes doesn't have this setting in elasticsearch.yml .
I would be glad to take further advice about any other handy settings.

Thx Warkolm...

Tombal · August 21, 2021, 8:47am

Just to give brief info in case it helps others;

My final numbers atm;


thread_pool.write.queue_size: 250
indices.memory.index_buffer_size: 30%

(higher percentages like 40,50 seemed to lag servers a bit and didn't make much difference for me - kibana started to lose connection with nodes not sure if it is about it but _cat/nodes started to take over 1.5min)

İndex template and active index changes;

increased "index.refresh.interval" from 5s to 15s , some of my data has to be searched close to real time.

added

"index.translog.durability" : "async"

(default is indexing per request-event , now it is by default for every 5seconds if i'm not mistaken , will watch for a while may change it to 10secs later)

Also changed 1 index template for daily indexes which doesn't keep huge data.
My default was "number of shards": 4 , "auto_expand_replicas" : "0-1" , on these lowered it to "number_of_shards": 3.

My retention period changes according to indexes.
Either 15 days,2 months or up to 2 years (month based indexes). Lowered shard count for close to 2 years indices with SHRINK api.

stephenb · August 21, 2021, 2:51pm

@Tombal

Thanks for sharing your updates

Perhaps, another setting you can look at is this, this helped on a heavy write use case I had increasing the flush_threshold_size

My settings on this particular index, windows event logs

    },
    "refresh_interval": "30s",
    "number_of_shards": "4",
    "translog": {
      "flush_threshold_size": "2gb"
    },

Tombal · August 22, 2021, 2:18am

Thx Steph...
I'll slightly increase refresh_interval and flush_threshold_size to see difference step by step like (20sec and 1gb for start)

My current rates atm ;
there is only one node which has a rejection rate of %0.2 other nodes either ok or like 1/100000 or so this reject node has some heavy indexed indice shards. So i guess i'm nearly done with my problems. Some slight improvements will make it ok.

system · September 19, 2021, 2:18am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES rejecting bulk messages when writing to indices with 40 shards Elasticsearch	12	5614	March 22, 2019
Process Rejection Question Elasticsearch docker	5	353	April 13, 2021
Bulk indexing rejected threads Elasticsearch	14	699	April 13, 2020
How to reduce thread-pool data rejection in elasticsearch cluster? Elasticsearch	24	8647	February 7, 2019
Increased write rejection count since elastic 8 upgrade Elasticsearch	0	102	May 23, 2024

Elasticsearch Increasing Write rejections

Related topics