Elasticsearch 7.4 query_then_fetch slow log

A cluster of elasticsearch version 7.4 is deployed. There are about 20 servers in the cluster. Three nodes are started on each machine. The startup memory occupies 30G. The server is 88-core cpu and 256G memory.
The index is generated according to the day, such as access-log-2023.07.31 . The index size is about 1T of data.
The problem found so far is that when using Kibana's discover to view index data, there will be occasional slow query problems. For example, when viewing the data of the last 2 days, sometimes it will be stuck on the shard of a certain machine and wait until in the request body timeout":"120000ms" will respond.
In an index, some shards can be returned in 30 seconds, and some shards can be returned in 2 minutes. When I migrate the shards to other servers, the query will be faster, and then migrate back to the original node, the query may be normal. Have you ever encountered this abnormal phenomenon? What causes it?

What is the average shard size for the index you are querying?

How many indices/shards are you querying at a time?

How much data does each node hold?

What type of storage are you using? Local SSDs?

What is the average shard size for the index you are querying?
The index I queried has 12 shards 1 and 1 copy, the total number of primary shards is 500G, and the total data collection is 1000G. The number of documents is 1.3 billion.

How many indices/shards are you querying at a time?
For example, if I check the last two day in kibana's discover, the index will query access-log-2023.07.31 and access-log-2023.08.01

How much data does each node hold?
The entire cluster has more than 900 indexes, a total of 4500 fragments, which are evenly distributed on 80 nodes, and each node has about 55 fragments

What type of storage are you using? Local SSDs
High-performance SSD disks are used

The data distribution strategy is managed by elasticsearch itself and distributed evenly. There are indexes of different sizes in the cluster for log data storage and various logs. I turned on the slow query log for more than 30s, some shards can return in 30s, 40s, and some shards’ slow log record time will be the same as the {"timeout":"300000ms"} time in the _search sent by the request body Same, how long is set here, took[5m], took_millis[300577] in the slow log of the slow shard is the same as this data, set {"timeout":"60000ms"}, the slow log returned by the slow shard is taken [1m]. It feels like this shard will not respond.
Query when the timeout is set to 300000ms, Kibana index monitoring items, "request rate", "request time(ms)", "latency", these three monitoring values ​​will have a value in the first 2 minutes, and then stabilized by 2 Minutes, there will be a value in the next 2 minutes, and there will be 2 minutes of idle time in the middle. I don't know what this phenomenon means.

Do you have any ideas or methods to locate the problem?

Can you please provide the full output of the cluster stats API?

Do you with fragments mean shards?

Are there any warnings about long or frequent GC in the Elasticsearch logs? What is the heap size of the nodes?

_cluster/stats
{"_nodes":{"total":84,"successful":84,"failed":0},"cluster_name":"xxx","cluster_uuid":"xxxxxxxxxxxxxxxxxxxxxx","timestamp":1690968684830,"status":"green","indices":{"count":912,"shards":{"total":475 8,"primaries":2379,"replication":1.0,"index":{"shards":{"min":2,"max":3 0,"avg":5.217105263157895},"primaries":{"min":1,"max":15,"avg":2.6085526315789473},"replication":{"min":1.0,"max":1.0,"avg":1.0}}},"docs ":{"count":37483453682,"deleted":289116},"store":{"size_in_bytes":35377724558266},"fielddata":{"memory_size_in_bytes":302128,"evictions":0},"query_cache":{"memory_size_in_bytes":943472823,"total_count":50 57398,"hit_count":725887,"miss_count":4331511,"cache_size":22867,"ca che_count":31270,"evictions":8403},"completion":{"size_in_bytes":0},"segments":{"count":79794,"memory_in_bytes":30723842376,"terms_me mory_in_bytes":17195078109,"stored_fields_memory_in_bytes":1080636 9616,"term_vectors_memory_in_bytes":0,"norms_memory_in_bytes":11 9393984,"points_memory_in_bytes":2419127699,"doc_values_memory_i n_bytes":183872968,"index_writer_memory_in_bytes":1532396371,"vers ion_map_memory_in_bytes":1062308,"fixed_bit_set_memory,in_bytes":14227960,"max_unsafe_auto_id_timestamp":1690944791212,"file_sizes":{}}},"nodes":{"count":{"total":84,"coordinating_only":0,"data":84,"inges t":84,"master":9,"ml":0,"voting_only":0},"versions":["7.4.2"],"os":{"avail able_processors":7392,"allocated_processors":7392,"names":[{"name":"L inux","count":84}],"pretty_names":[{"pretty_name":"CentOS Linux 7 (Core)","count":84}],"mem":{"total_in_bytes":22683986227200,"free_in_b ytes":577968766976,"used_in_bytes":22106017460224,"free_percent":3,"used_percent":97}},"process":{"cpu":{"percent":22},"open_file_descrip tors":{"min":4360,"max":5015,"avg":4584}},"jvm":{"max_uptime_in_millis":4390267377,"versions":[{"version":"13.0.1","vm_name":"OpenJDK 64-Bit Server VM","vm_version":"13.0.1+9","vm_vendor":"AdoptOpenJDK ","bundled_jdk":true,"using_bundled_jdk":false,"count":84}],"mem":{"heap_used_in_bytes":1141582433168,"heap_max_in_bytes":2663330611200},"threads":63498},"fs":{"total_in_bytes":81861694922752,"free_in_b ytes":45997344972800,"available_in_bytes":45997344972800},"plugins":,"network_types":{"transport_types":{"security4":84},"http_types":{"s ecurity4":84}},"discovery_types":{"zen":84},"packaging_types":[{"flavor ":"default","type":"tar","count":84}]}}

fragments is shards
no warnings logs

Let's stick to official terminology and refer to shards as shards.

I see that you have edited the response and removed parts.Do you have any third party plugins installed that may affect resource usage and/or the performance of the cluster?

Also note that the version you are running is very old and has been EOL a long time. I would recommend upgrading to at least version 7.17.

Our cluster does not have any plugins installed.
_cluster/stats returns information, I modified the sensitive information of the cluster name, and the other contents are complete information.
Have you ever experienced this phenomenon that I encountered? I don't know how to deal with it now, the cost of upgrading is very high for us.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.