Elasticsearch cpu/load high with search thread pool queues high


#1

Elasticsearch cluster queues spike constanly, cpu/load increasing and mostly looking for advice on optimization, not looking for a single answer though.

Main indices constantly being updated (bulk) with 300s refresh interval.

CPU usage

07:37:48 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
07:37:53 PM     all     74.38      0.00      1.27      0.00      0.06     24.29

Load

19:38:26 up 18 days, 8:08, 3 users, load average: 20.09, 16.88, 16.08

Memory

$ free -g
              total        used        free      shared  buff/cache   available
Mem:            119          30           1           0          88          88
Swap:             0           0           0

Biggest indices

512 shards 
Avg shard size 3-4gb
Total size 4.1tb

64 shards
Avg shard size 3gb 
Total size 408gb

Kibana/X-pack

Thread pool

id                     host          name   active queue rejected  completed
GSGumFBBRZS0bQ0aoVMsF 10.201.20.89  search     25    89        0  827820578
0sjcw6fRQDmKjMABkBoFM 10.201.20.195 search      5     0        0 1301852541
ORc35jDwSAa23gYKwqxGA 10.201.20.116 search     12     0        0  830988743
mjBEn8OETA-vQGrhK0W8X 10.201.20.61  search     16     2       61 1258009280
0TPNIdweSPunxXpehP0WK 10.201.20.169 search     23     3        0  876063395
kHG_ZXegTBOo5UaWFCEH0 10.201.20.98  search     12     1      294  930992459
BCCfQTZVR96dsfgYguRdD 10.201.20.235 search     25    22      113  878374510
T_YjqiNhTyOZ16DX3YqwC 10.201.20.254 search     25    20      157  902678205
WsDzh87ET6mFZFIUM-KOw 10.201.20.166 search     15     2        0  869113390
pnQCehRdTy6Brq7NAHEXG 10.201.20.148 search      3     0        0  409167893
uj8dWi4JSnyPXRzjA0xJm 10.201.20.60  search     21     1        0   49490497
B5Z4mY2nTFWv8xNqgl1O_ 10.201.20.19  search     11     0        0  358465691
DFWMHq_hRcaheA6EbYwAy 10.201.20.213 search     25   129        0   49533248
0J3CoHe0RK-g42BHVOAso 10.201.20.18  search     16     0      125 1306944101
seW_iSbwRUahXXRGmHBhL 10.201.20.105 search      5     0        0   58982461
Mxuo0aZaTVKCu4DA7pguY 10.201.20.57  search     20     3       11  871736991
nV0jd9_9SBazFJBMcW6fN 10.201.20.107 search     25    17      257 1301609159
lPRTFUWLSoqfy5WyNVIhq 10.201.20.228 search     25    24        0   58576149
dViqleV_TRCMCjY_10drG 10.201.20.219 search     11     0        0  911538587
Qu52J_ZuTgO5VK_jb7hyH 10.201.20.252 search     10     0      757 1337745981

Hot threads:
https://gist.githubusercontent.com/ofrivera/60ed0849318e78cee56ea8b39d50c57b/raw/cc75d1eabc03f02598cd2219f6876e066e3640a6/gistfile1.txt

Specs

AWS i3.4xlarge
20 nodes (all data nodes)
120gb RAM memory
16cpus

Config file elastic

bootstrap.memory_lock: true
cluster.name: elasticsearch-prod
discovery.ec2.host_type: private_ip
discovery.ec2.tag.ESCluster: elasticsearch-prod
discovery.zen.hosts_provider: ec2
http.cors.allow-origin : "*"
http.cors.enabled : true
indices.fielddata.cache.size: "50%"
network.host: "_ec2_"
node.data: true
node.ingest: true
node.master: true
node.name: elasticsearch-prod-node-i-00f084a45088465db
path.data: /elasticsearch/data
path.logs: /var/log/elasticsearch
xpack.security.enabled: false
http.port: 9200

Config file JVM

-Dfile.encoding=UTF-8
-Dio.netty.noKeySetOptimization=true
-Dio.netty.noUnsafe=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Djava.awt.headless=true
-Djdk.io.permissionsUseCanonicalPath=true
-Djna.nosys=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j.skipJansi=true
-Dlog4j2.disable.jmx=true
-XX:+AlwaysPreTouch
-XX:+DisableExplicitGC
-XX:+HeapDumpOnOutOfMemoryError
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-Xms26g
-Xmx26g
-Xss1m
-server

Health

{
  "cluster_name" : "elasticsearch-prod",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 20,
  "number_of_data_nodes" : 20,
  "active_primary_shards" : 629,
  "active_shards" : 1258,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

OS
CentOS Linux release 7.5.1804 (Core)
3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

RPM:
elasticsearch-6.4.2.rpm
Details:
elasticsearch-6.4.2-1.noarch

JAVA:
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)

Any insight greatly appreciated!


(Alexander Reelsen) #2

If CPU is spiking, you can use the hot_threads API to find out what parts of the code are causing the high load.

Put that into a gist and link it over here, so we can take a look!


#3

Sure, this is the how hot_threads api looked when I posted this topic:

https://gist.githubusercontent.com/ofrivera/60ed0849318e78cee56ea8b39d50c57b/raw/cc75d1eabc03f02598cd2219f6876e066e3640a6/gistfile1.txt

this is how it looks now:

https://gist.githubusercontent.com/ofrivera/ed1548081df0483edbf0b35d05121e7a/raw/f78ea3dab75ac9dae5049a2fb30f6e217cd59738/gistfile1.txt

Thanks!


(Alexander Reelsen) #4

after a quick peek it looks as if regular search operations are the cause - you just seem to do a lot of searching. I saw some scoring involved, some aggregations, etc...

Maybe there is a chance to simplify those queries? Maybe you can find out which queries are potentially causing this spike? The search slow log could help here.

I saw there is some dynamic scripting involved, maybe that can be made less dynamic by indexing more data (if that is really the culprit, it could also be another query).


#5

Thanks! I'll have a look at slow queries and try to find a pattern from there.