Elasticsearch Increasing Write rejections

Hello,
I have 20 node platinum licenced ELK cluster. (Version 7.6.1)

Setup is stated as below.

16 data nodes --> each 64GB memory // 32gb allocated as Xms-Xmx jvm . --> 8TB disk each
3 master nodes --> 32gb memory each
1 Ml node --> 64 gb memory

My problem is i checked most of the index tuning advisory and threadpool write rejection cases somehow my rejections started to increase for a while , 2 nodes nearly have %5 rejection rate.

threadpool

My indexing rates are like as stated below , the spikes are huge but i get data from Database, syslog, log files and directly forwarded to my elasticsearch from fluentbit or logstash of some other services all sources have different behaviour.

My most used index template is as below. So most of the indexes are 4 primary and 1 replica.
Refresh interval may be better if higher value i guess.

> {
>   "main_temp" : {
>     "order" : 0,
>     "index_patterns" : [
>       "csm_siem_record_*",
>       "csm_siem_restored_*"
>     ],
>     "settings" : {
>       "index" : {
>         "codec" : "best_compression",
>         "refresh_interval" : "5s",
>         "analysis" : {
>           "filter" : {
>             "ntk_asciifolding" : {
>               "type" : "asciifolding",
>               "preserve_original" : "false"
>             },
>             "ntk_turkce_lowercase" : {
>               "type" : "lowercase",
>               "language" : "turkish"
>             }
>           },
>           "analyzer" : {
>             "raw_log" : {
>               "filter" : [
>                 "ntk_turkce_lowercase",
>                 "ntk_asciifolding"
>               ],
>               "type" : "custom",
>               "tokenizer" : "classic"
>             }
>           }
>         },
>         "number_of_shards" : "4",
>         "auto_expand_replicas" : "0-1",
>         "query" : {
>           "default_field" : "raw_record"
>         }
>       }
>     },
>     "mappings" : {
>       "dynamic_templates" : [
>         {
>           "notanalyzed" : {
>             "mapping" : {
>               "type" : "keyword"
>             },
>             "match_mapping_type" : "string",
>             "match" : "*"
>           }
>         }
>       ],
>       "date_detection" : false,
>       "properties" : {
>         "received_bytes" : {
>           "type" : "long"
>         },
>         "destination_port" : {
>           "type" : "integer"
>         },
>         "jvbrtt" : {
>           "type" : "short"
>         },
>         "sent_packets" : {
>           "type" : "long"
>         },
>         "responseData" : {
>           "ignore_above" : 8191,
>           "type" : "keyword"
>         },
>         "translated_source_ip" : {
>           "type" : "ip"
>         },
>         "raw_record" : {
>           "analyzer" : "raw_log",
>           "type" : "text"
>         },
>         "packets" : {
>           "type" : "long"
>         },
>         "session_time" : {
>           "type" : "long"
>         },
>         "source_ip" : {
>           "type" : "ip"
>         },
>         "sent_bytes" : {
>           "type" : "long"
>         },
>         "download" : {
>           "type" : "short"
>         },
>         "duration" : {
>           "type" : "long"
>         },
>         "destination_ip" : {
>           "type" : "ip"
>         },
>         "translated_destination_ip" : {
>           "type" : "ip"
>         },
>         "translated_source_port" : {
>           "type" : "integer"
>         },
>         "date_time" : {
>           "format" : "strict_date_optional_time||epoch_millis",
>           "type" : "date"
>         },
>         "translated_destination_port" : {
>           "type" : "integer"
>         },
>         "source_port" : {
>           "type" : "integer"
>         },
>         "connectionquality" : {
>           "type" : "float"
>         },
>         "responsetime" : {
>           "type" : "keyword"
>         },
>         "issuccessful" : {
>           "type" : "keyword"
>         },
>         "requestData" : {
>           "ignore_above" : 8191,
>           "type" : "keyword"
>         },
>         "received_packets" : {
>           "type" : "long"
>         },
>         "severity" : {
>           "type" : "integer"
>         },
>         "responsecode" : {
>           "type" : "keyword"
>         },
>         "coordinate" : {
>           "type" : "geo_point"
>         },
>         "destination_ip_coordinate" : {
>           "type" : "geo_point"
>         },
>         "count" : {
>           "type" : "long"
>         },
>         "maxframeheight" : {
>           "type" : "integer"
>         },
>         "translated_ip" : {
>           "type" : "ip"
>         },
>         "upload" : {
>           "type" : "short"
>         },
>         "resp_time" : {
>           "type" : "integer"
>         },
>         "message" : {
>           "analyzer" : "raw_log",
>           "type" : "text"
>         },
>         "isservererror" : {
>           "type" : "keyword"
>         },
>         "sign_time" : {
>           "format" : "strict_date_optional_time||epoch_millis",
>           "type" : "date"
>         },
>         "device_ip" : {
>           "type" : "ip"
>         },
>         "source_ip_coordinate" : {
>           "type" : "geo_point"
>         },
>         "upload__1" : {
>           "type" : "short"
>         },
>         "download__1" : {
>           "type" : "short"
>         },
>         "bytes" : {
>           "type" : "long"
>         },
>         "starttime" : {
>           "type" : "keyword"
>         },
>         "lastn" : {
>           "type" : "short"
>         },
>         "email_subject" : {
>           "analyzer" : "raw_log",
>           "type" : "text"
>         }
>       }
>     },
>     "aliases" : { }
>   }
> }

I will be really glad if you have further recommendations.

Hi @Tombal

As a platinum customer you are entitled to support on these kind of issues.

Have you filed a support ticket, Elastic support is not just break / fix they certainly can help take a look at this with you and help diagnose and make suggestions.

That would be my first suggestion.

1 Like

Also my quick notice is that you say that you set the JVM to 32 GB,
That is actually not optimum the JVM needs to be set somewhere between 26 to 30 GB otherwise you are no longer using compressed object pointers and will actually make the system less performant.

See here

I'm not saying that's a source your problem but it's certainly not optimal.

Thx Stephen , i will have a look.

There have been improvements in this area in recent versions so it might be a good idea to upgrade.

This is also on my schedule... Need to make sure every job working atm is completely suitable with newer versions.

By the way , made some adjustments on the worst situation nodes and after final restart and adding index_buffer_size parameter after 4 hours cluster got green. Now the rejection rate is %0.01 and even lower than 1 of 10000 . That is becoming more acceptable.

The highest queue size doesn't make it any better after 500 for me , so i lowered it to 400.
Buffer size parameter was not added so was default to %10 i guess.

thread_pool.write.queue_size: 400
indices.memory.index_buffer_size: 40%

I hope it works for some other people...

That's not really a good idea. Take a look at Any idea what these errors mean version 2.4.2 - #5 by jasontedor for a good overview of why

You are right , the higher queue makes it worse . It doesn't fix the problem. It is still out there.
So i lowered it , maybe i'll remove the setting as well , trying every possible combination.

My other data nodes doesn't have this setting in elasticsearch.yml .
I would be glad to take further advice about any other handy settings.

Thx Warkolm...

Just to give brief info in case it helps others;

My final numbers atm;


thread_pool.write.queue_size: 250
indices.memory.index_buffer_size: 30%

(higher percentages like 40,50 seemed to lag servers a bit and didn't make much difference for me - kibana started to lose connection with nodes not sure if it is about it but _cat/nodes started to take over 1.5min)

─░ndex template and active index changes;

increased "index.refresh.interval" from 5s to 15s , some of my data has to be searched close to real time.

added

"index.translog.durability" : "async"

(default is indexing per request-event , now it is by default for every 5seconds if i'm not mistaken , will watch for a while may change it to 10secs later)

Also changed 1 index template for daily indexes which doesn't keep huge data.
My default was "number of shards": 4 , "auto_expand_replicas" : "0-1" , on these lowered it to "number_of_shards": 3.

My retention period changes according to indexes.
Either 15 days,2 months or up to 2 years (month based indexes). Lowered shard count for close to 2 years indices with SHRINK api.

@Tombal

Thanks for sharing your updates

Perhaps, another setting you can look at is this, this helped on a heavy write use case I had increasing the flush_threshold_size

My settings on this particular index, windows event logs

    },
    "refresh_interval": "30s",
    "number_of_shards": "4",
    "translog": {
      "flush_threshold_size": "2gb"
    },

Thx Steph...
I'll slightly increase refresh_interval and flush_threshold_size to see difference step by step like (20sec and 1gb for start)

My current rates atm ;
there is only one node which has a rejection rate of %0.2 other nodes either ok or like 1/100000 or so this reject node has some heavy indexed indice shards. So i guess i'm nearly done with my problems. Some slight improvements will make it ok.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.