ES 5.5.1 extremely slow on multiple indices

Hi there,

We're using ES 5.5.1 for archival purposes.
Lately, the search engine has become extremely slow, some queries run for hours on end, until they finally time out, others time out rather quickly.

In this one particular instance, the index is > 600GiB with just under 2 million documents.
Below are the index' settings:
"settings" : {
"index" : {
"mapping" : {
"total_fields" : {
"limit" : "65535"
"refresh_interval" : "1m",
"number_of_shards" : "9",
"auto_expand_replicas" : "2-15",
"provided_name" : "de_gdsk",
"creation_date" : "1492077704119",
"store" : {
"type" : "mmapfs"
"number_of_replicas" : "2",
"queries" : {
"cache" : {
"enabled" : "true"
"uuid" : "4LB_JsIdToWwly9JZzX_WQ",
"version" : {
"created" : "5020299",
"upgraded" : "5050199"

This setup has worked for over a year, and now it is failing.
This index is rather old and is not used very often, but it contains documents vital to a customer's ability to stay in business, and after a recent file transfer (from Linux to Windows) certain paths in the documents require modifications.

A query on said index takes extremely long to complete, and as of today it ends with a shard failure. Mostly the request simply times out.

This is the shard failure:
[2018-05-28T10:42:07,408][DEBUG][o.e.a.s.TransportSearchAction] [node-windows] [85998] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [node-windows-3][][indices:data/read/search[phase/fetch/id]]
Caused by: No search context found for id [85998]
at ~[elasticsearch-5.5.1.jar:5.5.1]
at ~[elasticsearch-5.5.1.jar:5.5.1]
at$12.messageReceived( ~[elasticsearch-5.5.1.jar:5.5.1]
at$12.messageReceived( ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived( ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun( ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun( ~[elasticsearch-5.5.1.jar:5.5.1]
at ~[elasticsearch-5.5.1.jar:5.5.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker( ~[?:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$ ~[?:1.8.0_144]
at [?:1.8.0_144]

Further more, the cluster is experiencing a lot of GC overhead, spewing logs like the following in to the console on all three nodes:
[2018-05-28T09:58:44,288][INFO ][o.e.m.j.JvmGcMonitorService] [node-windows] [gc][235] overhead, spent [468ms] collecting in the last [1.2s]

Any help on the matter is very much appreciated!
Without the cluster working, no documents can be indexed or retrieved and several other services fail also.

Thanks in advance!

You index data size is too large. I think you should consider time-based indices. It is recommended to keep shard size < 50G. Based on your data size, your average shard size is around (600G/9) = 66G.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.