Elasticsearch cpu/load high with search thread pool queues high

Elasticsearch cluster queues spike constanly, cpu/load increasing and mostly looking for advice on optimization, not looking for a single answer though.

Main indices constantly being updated (bulk) with 300s refresh interval.

CPU usage

07:37:48 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
07:37:53 PM     all     74.38      0.00      1.27      0.00      0.06     24.29

Load

19:38:26 up 18 days, 8:08, 3 users, load average: 20.09, 16.88, 16.08

Memory

$ free -g
              total        used        free      shared  buff/cache   available
Mem:            119          30           1           0          88          88
Swap:             0           0           0

Biggest indices

512 shards 
Avg shard size 3-4gb
Total size 4.1tb

64 shards
Avg shard size 3gb 
Total size 408gb

Kibana/X-pack

Thread pool

id                     host          name   active queue rejected  completed
GSGumFBBRZS0bQ0aoVMsF 10.201.20.89  search     25    89        0  827820578
0sjcw6fRQDmKjMABkBoFM 10.201.20.195 search      5     0        0 1301852541
ORc35jDwSAa23gYKwqxGA 10.201.20.116 search     12     0        0  830988743
mjBEn8OETA-vQGrhK0W8X 10.201.20.61  search     16     2       61 1258009280
0TPNIdweSPunxXpehP0WK 10.201.20.169 search     23     3        0  876063395
kHG_ZXegTBOo5UaWFCEH0 10.201.20.98  search     12     1      294  930992459
BCCfQTZVR96dsfgYguRdD 10.201.20.235 search     25    22      113  878374510
T_YjqiNhTyOZ16DX3YqwC 10.201.20.254 search     25    20      157  902678205
WsDzh87ET6mFZFIUM-KOw 10.201.20.166 search     15     2        0  869113390
pnQCehRdTy6Brq7NAHEXG 10.201.20.148 search      3     0        0  409167893
uj8dWi4JSnyPXRzjA0xJm 10.201.20.60  search     21     1        0   49490497
B5Z4mY2nTFWv8xNqgl1O_ 10.201.20.19  search     11     0        0  358465691
DFWMHq_hRcaheA6EbYwAy 10.201.20.213 search     25   129        0   49533248
0J3CoHe0RK-g42BHVOAso 10.201.20.18  search     16     0      125 1306944101
seW_iSbwRUahXXRGmHBhL 10.201.20.105 search      5     0        0   58982461
Mxuo0aZaTVKCu4DA7pguY 10.201.20.57  search     20     3       11  871736991
nV0jd9_9SBazFJBMcW6fN 10.201.20.107 search     25    17      257 1301609159
lPRTFUWLSoqfy5WyNVIhq 10.201.20.228 search     25    24        0   58576149
dViqleV_TRCMCjY_10drG 10.201.20.219 search     11     0        0  911538587
Qu52J_ZuTgO5VK_jb7hyH 10.201.20.252 search     10     0      757 1337745981

Hot threads:
https://gist.githubusercontent.com/ofrivera/60ed0849318e78cee56ea8b39d50c57b/raw/cc75d1eabc03f02598cd2219f6876e066e3640a6/gistfile1.txt

Specs

AWS i3.4xlarge
20 nodes (all data nodes)
120gb RAM memory
16cpus

Config file elastic

bootstrap.memory_lock: true
cluster.name: elasticsearch-prod
discovery.ec2.host_type: private_ip
discovery.ec2.tag.ESCluster: elasticsearch-prod
discovery.zen.hosts_provider: ec2
http.cors.allow-origin : "*"
http.cors.enabled : true
indices.fielddata.cache.size: "50%"
network.host: "_ec2_"
node.data: true
node.ingest: true
node.master: true
node.name: elasticsearch-prod-node-i-00f084a45088465db
path.data: /elasticsearch/data
path.logs: /var/log/elasticsearch
xpack.security.enabled: false
http.port: 9200

Config file JVM

-Dfile.encoding=UTF-8
-Dio.netty.noKeySetOptimization=true
-Dio.netty.noUnsafe=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Djava.awt.headless=true
-Djdk.io.permissionsUseCanonicalPath=true
-Djna.nosys=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j.skipJansi=true
-Dlog4j2.disable.jmx=true
-XX:+AlwaysPreTouch
-XX:+DisableExplicitGC
-XX:+HeapDumpOnOutOfMemoryError
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-Xms26g
-Xmx26g
-Xss1m
-server

Health

{
  "cluster_name" : "elasticsearch-prod",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 20,
  "active_primary_shards" : 629,
  "active_shards" : 1258,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

OS
CentOS Linux release 7.5.1804 (Core)
3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

RPM:
elasticsearch-6.4.2.rpm
Details:
elasticsearch-6.4.2-1.noarch

JAVA:
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)

Any insight greatly appreciated!

If CPU is spiking, you can use the hot_threads API to find out what parts of the code are causing the high load.

Put that into a gist and link it over here, so we can take a look!

Sure, this is the how hot_threads api looked when I posted this topic:

https://gist.githubusercontent.com/ofrivera/60ed0849318e78cee56ea8b39d50c57b/raw/cc75d1eabc03f02598cd2219f6876e066e3640a6/gistfile1.txt

this is how it looks now:

https://gist.githubusercontent.com/ofrivera/ed1548081df0483edbf0b35d05121e7a/raw/f78ea3dab75ac9dae5049a2fb30f6e217cd59738/gistfile1.txt

Thanks!

after a quick peek it looks as if regular search operations are the cause - you just seem to do a lot of searching. I saw some scoring involved, some aggregations, etc...

Maybe there is a chance to simplify those queries? Maybe you can find out which queries are potentially causing this spike? The search slow log could help here.

I saw there is some dynamic scripting involved, maybe that can be made less dynamic by indexing more data (if that is really the culprit, it could also be another query).

Thanks! I'll have a look at slow queries and try to find a pattern from there.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.