Low performance while searching

Phandora · October 29, 2020, 8:27pm

Hi,

We are experiencing a low searching performance in our Kubernetes ES cluster. We have a 5 node ES cluster with the following specifications per node:

ES_JAVA_OPTS: -Xms4096m -Xmx4096m
Cpu request: 500m
Cpu limit: 8
Memory request: 8Gi
Memory limit: 8Gi

Our cluster is running on AWS EC2 i3.4xlarge instances which provide:

Networking Performance: Up to 10 Gigabit
Storage (TB): 2 x 1.9 NVMe SSD.

We are using index lifecycle policies to roll over when the index reaches 125GiB (we create indices with 5 primary shards following best practices to keep the shards at a proper size).

We are ingesting around 125GiB per day and we keep about 2TB of data stored. Our data is fairly balanced between the nodes:

   shards disk.indices   node
     18      373.7gb    node-0
     17      463.6gb    node-1
     18      417.4gb    node-2
     18      311.1gb    node-3
     18      326.8gb    node-4

All nodes have the same role (master data nodes).

When searching for our last 7 days data, the query takes around 1 minute. However when trying to retrieve the last 15 days data, we got a timeout after 2 or 4 minutes (I don't understand why the timeout time changes since it always has the same configuration).

How could we improve the performance? What metrics could we look at to know why our queries are slow?

The resource usage when performing the queries is as follows:

CPU:

The nodes do not exceed 5 cpu even when there is cpu available in the instance.

JVM:

Best regards.

Christian_Dahlqvist · October 30, 2020, 4:02am

What else is running on the hosts? How much CPU are those pods using? What does the query you are running look like? Which version of Elasticsearch are you using?

Phandora · October 30, 2020, 10:13am

Hi Christian,

Thank you for your prompt reply.

On those hosts we are running other containerized applications which include other ES clusters. There is enough CPU available in the instances, the one with higher usage is at 50%:

The CPU used by the pods when indexing is the following one:

And the CPU used when indexing and searching is the one I showed you before:

We are using Elastic 7.7.0 version and the query performed is the following one (some headers and specific values has been hidden due to sensitive data):

curl 'https://<ip>:<port>/internal/search/es' \
--data-binary '{"params":{"preference":1604046642498,"index":"<index>","body":{"version":true,"size":"100","sort":[{"timestamp":{"order":"desc","unmapped_type":"boolean"}}],"aggs":{"2":{"date_histogram":{"field":"timestamp","fixed_interval":"<hours>","time_zone":"<time_zone>","min_doc_count":1}}},"stored_fields":["*"],"script_fields":{},"docvalue_fields":[{"field":"<field1>","format":"date_time"},{"field":"<field2>","format":"date_time"},{"field":"<field3>","format":"date_time"},{"field":"<field4>","format":"date_time"},{"field":"<field5>","format":"date_time"},{"field":"<field6>","format":"date_time"},{"field":"<field7>","format":"date_time"},{"field":"<field8>","format":"date_time"},{"field":"<field9>","format":"date_time"},{"field":"<field10>","format":"date_time"},{"field":"<field11>","format":"date_time"},{"field":"timestamp","format":"date_time"}],"_source":{"excludes":["@timestamp"]},"query":{"bool":{"must":[],"filter":[{"match_all":{}},{"range":{"timestamp":{"gte":"2020-10-15T07:31:35.811Z","lte":"2020-10-30T08:31:35.811Z","format":"strict_date_optional_time"}}}],"should":[],"must_not":[]}},"highlight":{"pre_tags":["@kibana-highlighted-field@"],"post_tags":["@/kibana-highlighted-field@"],"fields":{"*":{}},"fragment_size":2147483647}},"rest_total_hits_as_int":true,"ignore_unavailable":true,"ignore_throttled":true,"timeout":"240000ms"},"serverStrategy":"es"}' \
--compressed

Best regards.

Christian_Dahlqvist · October 30, 2020, 2:07pm

Can you get some information about the thread pool sizes, e.g. GET /_cat/thread_pool?v&h=id,name,size,active,core,max,pool_size?

Phandora · October 30, 2020, 4:15pm

Hi Christian,

We had to add two nodes to our ES cluster, so currently we have a 7 node cluster. The searching performance seems to be quite good after this change.

This is the output from the thread pool query:

GET /_cat/thread_pool?v&h=id,name,size,active,core,max,pool_size

id                     name                      size active core max pool_size
LiKwe1Z0RgaMUOGWRByB9Q ad-threadpool                2      0                  0
LiKwe1Z0RgaMUOGWRByB9Q analyze                      1      0                  0
LiKwe1Z0RgaMUOGWRByB9Q fetch_shard_started                 0    1  16         1
LiKwe1Z0RgaMUOGWRByB9Q fetch_shard_store                   0    1  16         1
LiKwe1Z0RgaMUOGWRByB9Q flush                               0    1   4         1
LiKwe1Z0RgaMUOGWRByB9Q force_merge                  1      0                  0
LiKwe1Z0RgaMUOGWRByB9Q generic                             0    4 128        10
LiKwe1Z0RgaMUOGWRByB9Q get                          8      0                  8
LiKwe1Z0RgaMUOGWRByB9Q listener                     4      0                  0
LiKwe1Z0RgaMUOGWRByB9Q management                          1    1   5         5
LiKwe1Z0RgaMUOGWRByB9Q refresh                             0    1   4         4
LiKwe1Z0RgaMUOGWRByB9Q search                      13      0                 13
LiKwe1Z0RgaMUOGWRByB9Q search_throttled             1      0                  0
LiKwe1Z0RgaMUOGWRByB9Q snapshot                            0    1   4         1
LiKwe1Z0RgaMUOGWRByB9Q sql-worker                   8      0                  0
LiKwe1Z0RgaMUOGWRByB9Q warmer                              0    1   4         0
LiKwe1Z0RgaMUOGWRByB9Q write                        8      0                  8
Y0SOM_rgSxy1jA8GRoWdfg ad-threadpool                2      0                  0
Y0SOM_rgSxy1jA8GRoWdfg analyze                      1      0                  0
Y0SOM_rgSxy1jA8GRoWdfg fetch_shard_started                 0    1  16         1
Y0SOM_rgSxy1jA8GRoWdfg fetch_shard_store                   0    1  16         1
Y0SOM_rgSxy1jA8GRoWdfg flush                               0    1   4         1
Y0SOM_rgSxy1jA8GRoWdfg force_merge                  1      0                  0
Y0SOM_rgSxy1jA8GRoWdfg generic                             0    4 128        12
Y0SOM_rgSxy1jA8GRoWdfg get                          8      0                  8
Y0SOM_rgSxy1jA8GRoWdfg listener                     4      0                  0
Y0SOM_rgSxy1jA8GRoWdfg management                          1    1   5         5
Y0SOM_rgSxy1jA8GRoWdfg refresh                             0    1   4         4
Y0SOM_rgSxy1jA8GRoWdfg search                      13      0                 13
Y0SOM_rgSxy1jA8GRoWdfg search_throttled             1      0                  0
Y0SOM_rgSxy1jA8GRoWdfg snapshot                            0    1   4         1
Y0SOM_rgSxy1jA8GRoWdfg sql-worker                   8      0                  0
Y0SOM_rgSxy1jA8GRoWdfg warmer                              0    1   4         0
Y0SOM_rgSxy1jA8GRoWdfg write                        8      0                  8
GU7-3wyfRvii3XOu0af1XA ad-threadpool                2      0                  0
GU7-3wyfRvii3XOu0af1XA analyze                      1      0                  0
GU7-3wyfRvii3XOu0af1XA fetch_shard_started                 0    1  16         1
GU7-3wyfRvii3XOu0af1XA fetch_shard_store                   0    1  16         1
GU7-3wyfRvii3XOu0af1XA flush                               0    1   4         2
GU7-3wyfRvii3XOu0af1XA force_merge                  1      0                  0
GU7-3wyfRvii3XOu0af1XA generic                             0    4 128        11
GU7-3wyfRvii3XOu0af1XA get                          8      0                  8
GU7-3wyfRvii3XOu0af1XA listener                     4      0                  0
GU7-3wyfRvii3XOu0af1XA management                          1    1   5         5
GU7-3wyfRvii3XOu0af1XA refresh                             0    1   4         4
GU7-3wyfRvii3XOu0af1XA search                      13      0                 13
GU7-3wyfRvii3XOu0af1XA search_throttled             1      0                  0
GU7-3wyfRvii3XOu0af1XA snapshot                            0    1   4         1
GU7-3wyfRvii3XOu0af1XA sql-worker                   8      0                  0
GU7-3wyfRvii3XOu0af1XA warmer                              0    1   4         4
GU7-3wyfRvii3XOu0af1XA write                        8      0                  8
uhd9NaGSR1KvhcCRUf7-jg ad-threadpool                2      0                  0
uhd9NaGSR1KvhcCRUf7-jg analyze                      1      0                  0
uhd9NaGSR1KvhcCRUf7-jg fetch_shard_started                 0    1  16         1
uhd9NaGSR1KvhcCRUf7-jg fetch_shard_store                   0    1  16         1
uhd9NaGSR1KvhcCRUf7-jg flush                               0    1   4         1
uhd9NaGSR1KvhcCRUf7-jg force_merge                  1      0                  0
uhd9NaGSR1KvhcCRUf7-jg generic                             0    4 128        11
uhd9NaGSR1KvhcCRUf7-jg get                          8      0                  8
uhd9NaGSR1KvhcCRUf7-jg listener                     4      0                  0
uhd9NaGSR1KvhcCRUf7-jg management                          1    1   5         5
uhd9NaGSR1KvhcCRUf7-jg refresh                             0    1   4         3
uhd9NaGSR1KvhcCRUf7-jg search                      13      0                 13
uhd9NaGSR1KvhcCRUf7-jg search_throttled             1      0                  0
uhd9NaGSR1KvhcCRUf7-jg snapshot                            0    1   4         1
uhd9NaGSR1KvhcCRUf7-jg sql-worker                   8      0                  0
uhd9NaGSR1KvhcCRUf7-jg warmer                              0    1   4         0
uhd9NaGSR1KvhcCRUf7-jg write                        8      0                  8
0zxhTft8QE2jrYxmIfcl8g ad-threadpool                2      0                  0
0zxhTft8QE2jrYxmIfcl8g analyze                      1      0                  0
0zxhTft8QE2jrYxmIfcl8g fetch_shard_started                 0    1  16         1
0zxhTft8QE2jrYxmIfcl8g fetch_shard_store                   0    1  16         1
0zxhTft8QE2jrYxmIfcl8g flush                               0    1   4         1
0zxhTft8QE2jrYxmIfcl8g force_merge                  1      0                  0
0zxhTft8QE2jrYxmIfcl8g generic                             0    4 128         9
0zxhTft8QE2jrYxmIfcl8g get                          8      0                  8
0zxhTft8QE2jrYxmIfcl8g listener                     4      0                  0
0zxhTft8QE2jrYxmIfcl8g management                          1    1   5         5
0zxhTft8QE2jrYxmIfcl8g refresh                             0    1   4         4
0zxhTft8QE2jrYxmIfcl8g search                      13      0                 13
0zxhTft8QE2jrYxmIfcl8g search_throttled             1      0                  0
0zxhTft8QE2jrYxmIfcl8g snapshot                            0    1   4         1
0zxhTft8QE2jrYxmIfcl8g sql-worker                   8      0                  0
0zxhTft8QE2jrYxmIfcl8g warmer                              0    1   4         1
0zxhTft8QE2jrYxmIfcl8g write                        8      0                  8
ya-p5-h7S0iMMWhFRewJIA ad-threadpool                2      0                  0
ya-p5-h7S0iMMWhFRewJIA analyze                      1      0                  0
ya-p5-h7S0iMMWhFRewJIA fetch_shard_started                 0    1  16         1
ya-p5-h7S0iMMWhFRewJIA fetch_shard_store                   0    1  16         1
ya-p5-h7S0iMMWhFRewJIA flush                               0    1   4         1
ya-p5-h7S0iMMWhFRewJIA force_merge                  1      0                  0
ya-p5-h7S0iMMWhFRewJIA generic                             0    4 128        12
ya-p5-h7S0iMMWhFRewJIA get                          8      0                  8
ya-p5-h7S0iMMWhFRewJIA listener                     4      0                  0
ya-p5-h7S0iMMWhFRewJIA management                          1    1   5         3
ya-p5-h7S0iMMWhFRewJIA refresh                             0    1   4         4
ya-p5-h7S0iMMWhFRewJIA search                      13      0                 13
ya-p5-h7S0iMMWhFRewJIA search_throttled             1      0                  0
ya-p5-h7S0iMMWhFRewJIA snapshot                            0    1   4         1
ya-p5-h7S0iMMWhFRewJIA sql-worker                   8      0                  0
ya-p5-h7S0iMMWhFRewJIA warmer                              0    1   4         0
ya-p5-h7S0iMMWhFRewJIA write                        8      0                  8
vJastktFR9C9sT4Qxvgg-g ad-threadpool                2      0                  0
vJastktFR9C9sT4Qxvgg-g analyze                      1      0                  0
vJastktFR9C9sT4Qxvgg-g fetch_shard_started                 0    1  16         1
vJastktFR9C9sT4Qxvgg-g fetch_shard_store                   0    1  16         1
vJastktFR9C9sT4Qxvgg-g flush                               0    1   4         2
vJastktFR9C9sT4Qxvgg-g force_merge                  1      0                  0
vJastktFR9C9sT4Qxvgg-g generic                             0    4 128         8
vJastktFR9C9sT4Qxvgg-g get                          8      0                  8
vJastktFR9C9sT4Qxvgg-g listener                     4      0                  0
vJastktFR9C9sT4Qxvgg-g management                          1    1   5         5
vJastktFR9C9sT4Qxvgg-g refresh                             0    1   4         4
vJastktFR9C9sT4Qxvgg-g search                      13      0                 13
vJastktFR9C9sT4Qxvgg-g search_throttled             1      0                  0
vJastktFR9C9sT4Qxvgg-g snapshot                            0    1   4         1
vJastktFR9C9sT4Qxvgg-g sql-worker                   8      0                  0
vJastktFR9C9sT4Qxvgg-g warmer                              0    1   4         4
vJastktFR9C9sT4Qxvgg-g write                        8      0                  8

When we had a 5 node cluster, it did not show any rejected threads:

GET _cat/thread_pool/search?v&h=node_name,name,active,rejected,completed

node_name name   active rejected completed
node-0    search      0        0      2489
node-1    search      0        0      2193
node-4    search      0        0      9287
node-3    search      0        0      9374
node-2    search      0        0      3269

What would you highlight about the thread pool output that could point out to a possible issue?
(not just about our specific output, but in general)

After adding 2 more nodes the cluster works fine... but we do not know the reason why it had a low performance before. What metrics can we look at to know if the problem is a lack of CPU, memory, or I/O?

Best regards.

system · November 27, 2020, 4:15pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch High CPU high load when searching Elasticsearch	6	1226	July 6, 2017
Performance issue in my elastic search cluster Elasticsearch	8	507	September 26, 2019
Running into Elasticsearch high search latency 5-10s issue in production Elasticsearch	13	3314	July 5, 2017
A performance issue about elasticsearch on k8s Elasticsearch	11	2693	February 16, 2020
Abnormally high CPU usage for specific queries/dashboards Elasticsearch	2	118	May 15, 2024

Low performance while searching

Related topics