Due to heavy load, Shards getting unassigned

Hi,

I have a cluster contains 3 data nodes 31 GB of each, and 3 master nodes 8 GB of each. I have around a total of 252 shards. The total data size is approx. 65 GB.

I am using 7.7.1 ElasticSearch

When I run a load of approx. 600 Users, concurrent search, and 200 users indexing, My shards are going into an unassigned state within 10-15 minutes. My all nodes are up.

My nodes first getting disconnected and rejoined the cluster automatically, but after rejoining, shards get unassigned.

Wrapped by: org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:551) ~[na:na]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309) ~[na:na]
	at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:231) ~[na:na]
	at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$1(FetchSearchPhase.java:119) ~[na:na]
	at org.elasticsearch.action.search.CountedCollector.countDown(CountedCollector.java:53) ~[na:na]
	at org.elasticsearch.action.search.CountedCollector.onFailure(CountedCollector.java:76) ~[na:na]
	at org.elasticsearch.action.search.FetchSearchPhase$2.onFailure(FetchSearchPhase.java:198) ~[na:na]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[na:na]
	at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:402) ~[na:na]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) ~[na:na]
	at org.elasticsearch.transport.TransportService$8.run(TransportService.java:1001) ~[na:na]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[na:na]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:na]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:na]
	at java.lang.Thread.run(Thread.java:748) ~[na:na]

Also, I am getting below exception.

[2020-08-05T14:14:54,724][WARN ][o.e.i.c.IndicesClusterStateService] [test-cluster] [test-index][1] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:481) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:718) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:176) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:592) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:568) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:248) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:517) ~[elasticsearch-7.7.1.jar:7.7.1]
        at java.lang.Iterable.forEach(Iterable.java:75) [?:1.8.0_202]
        at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:514) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.cluster.service.ClusterApplierService.access$100(ClusterApplierService.java:73) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.7.1.jar:7.7.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [test-index][1]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:771) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:686) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:401) ~[elasticsearch-7.7.1.jar:7.7.1]
        ... 18 more

What changes need to be done at the cluster level to make the performance better?

Thanks

What type of storage are your nodes using? What is the full output of the cluster stats API?

Hi @Christian_Dahlqvist, Thank you for the quick response. Please have a look of the below information.

Cluster Stats.

{
"_nodes": {
    "total": 6,
    "successful": 6,
    "failed": 0
},
"cluster_name": "test-cluster",
"cluster_uuid": "aVCH8KGiQWeT-LmLaXpICA",
"timestamp": 1597079387394,
"status": "green",
"indices": {
    "count": 62,
    "shards": {
        "total": 252,
        "primaries": 126,
        "replication": 1.0,
        "index": {
            "shards": {
                "min": 4,
                "max": 8,
                "avg": 4.064516129032258
            },
            "primaries": {
                "min": 2,
                "max": 4,
                "avg": 2.032258064516129
            },
            "replication": {
                "min": 1.0,
                "max": 1.0,
                "avg": 1.0
            }
        }
    },
    "docs": {
        "count": 30567482,
        "deleted": 1052
    },
    "store": {
        "size_in_bytes": 63394460642
    },
    "fielddata": {
        "memory_size_in_bytes": 0,
        "evictions": 0
    },
    "query_cache": {
        "memory_size_in_bytes": 0,
        "total_count": 0,
        "hit_count": 0,
        "miss_count": 0,
        "cache_size": 0,
        "cache_count": 0,
        "evictions": 0
    },
    "completion": {
        "size_in_bytes": 0
    },
    "segments": {
        "count": 2301,
        "memory_in_bytes": 95544720,
        "terms_memory_in_bytes": 79381344,
        "stored_fields_memory_in_bytes": 1618360,
        "term_vectors_memory_in_bytes": 0,
        "norms_memory_in_bytes": 10557440,
        "points_memory_in_bytes": 0,
        "doc_values_memory_in_bytes": 3987576,
        "index_writer_memory_in_bytes": 0,
        "version_map_memory_in_bytes": 0,
        "fixed_bit_set_memory_in_bytes": 0,
        "max_unsafe_auto_id_timestamp": -1,
        "file_sizes": {}
    },
    "mappings": {
        "field_types": [
            {
                "name": "boolean",
                "count": 3,
                "index_count": 1
            },
            {
                "name": "date",
                "count": 808,
                "index_count": 62
            },
            {
                "name": "float",
                "count": 183,
                "index_count": 61
            },
            {
                "name": "geo_point",
                "count": 61,
                "index_count": 61
            },
            {
                "name": "integer",
                "count": 613,
                "index_count": 61
            },
            {
                "name": "keyword",
                "count": 11276,
                "index_count": 62
            },
            {
                "name": "text",
                "count": 10483,
                "index_count": 62
            }
        ]
    },
    "analysis": {
        "char_filter_types": [],
        "tokenizer_types": [],
        "filter_types": [],
        "analyzer_types": [
            {
                "name": "custom",
                "count": 62,
                "index_count": 62
            }
        ],
        "built_in_char_filters": [],
        "built_in_tokenizers": [
            {
                "name": "keyword",
                "count": 62,
                "index_count": 62
            }
        ],
        "built_in_filters": [
            {
                "name": "lowercase",
                "count": 62,
                "index_count": 62
            }
        ],
        "built_in_analyzers": []
    }
},
"nodes": {
    "count": {
        "total": 6,
        "coordinating_only": 0,
        "data": 3,
        "ingest": 0,
        "master": 3,
        "ml": 0,
        "remote_cluster_client": 0,
        "transform": 0,
        "voting_only": 0
    },
    "versions": [
        "7.7.1"
    ],
    "os": {
        "available_processors": 192,
        "allocated_processors": 192,
        "names": [
            {
                "name": "Linux",
                "count": 6
            }
        ],
        "pretty_names": [
            {
                "pretty_name": "CentOS release 6.6 (Final)",
                "count": 6
            }
        ],
        "mem": {
            "total_in_bytes": 1625048629248,
            "free_in_bytes": 461002936320,
            "used_in_bytes": 1164045692928,
            "free_percent": 28,
            "used_percent": 72
        }
    },
    "process": {
        "cpu": {
            "percent": 0
        },
        "open_file_descriptors": {
            "min": 767,
            "max": 1480,
            "avg": 1105
        }
    },
    "jvm": {
        "max_uptime_in_millis": 1045591,
        "versions": [
            {
                "version": "1.8.0_202",
                "vm_name": "Java HotSpot(TM) 64-Bit Server VM",
                "vm_version": "25.202-b08",
                "vm_vendor": "Oracle Corporation",
                "bundled_jdk": true,
                "using_bundled_jdk": false,
                "count": 6
            }
        ],
        "mem": {
            "heap_used_in_bytes": 21837092232,
            "heap_max_in_bytes": 146129682432
        },
        "threads": 755
    },
    "fs": {
        "total_in_bytes": 930503200768,
        "free_in_bytes": 762936004608,
        "available_in_bytes": 715655503872
    },
    "plugins": [],
    "network_types": {
        "transport_types": {
            "security4": 6
        },
        "http_types": {
            "security4": 6
        }
    },
    "discovery_types": {
        "zen": 6
    },
    "packaging_types": [
        {
            "flavor": "default",
            "type": "tar",
            "count": 6
        }
    ],
    "ingest": {
        "number_of_pipelines": 2,
        "processor_stats": {
            "gsub": {
                "count": 0,
                "failed": 0,
                "current": 0,
                "time_in_millis": 0
            },
            "script": {
                "count": 0,
                "failed": 0,
                "current": 0,
                "time_in_millis": 0
            }
        }
    }
}

Also, I am storing data on the local machines. The storage type is EXT4.

What type of storage are you using? Local SSDs?

Yes @Christian_Dahlqvist. Local SSDs.

Do you have memory swapping to disk?

Why ssd with ext4 and not xfs? just curious.

how many ssds, are they raided ?
are the ssds also running the os?

The problem here is the nodes leaving the cluster. When that happens all sorts of other things will go wrong.

On the elected master, look for logs containing the phrase node-left from the MasterService and share them here please.

Sorry my bad. It is SSDs with xfs. Sorry for the confusion.

Hi @DavidTurner, Please find the below exceptions.

First On master node and also two of the data node, I can see full GC is running. I am sharing master node logs first and then data nodes.

Master node :

[2020-08-10T17:49:46,186][WARN ][o.e.m.j.JvmGcMonitorService] [master-node-1] [gc][young][3246][3] duration [1s], collections [1]/[1.2s], total [1s]/[1.2s], memory [4.5gb]->[1.3gb]/[15.3gb], all_pools {[young] [3.9gb]->[65.2mb]/[4gb]}{[survivor] [570.4mb]->[682.6mb]/[682.6mb]}{[old] [0b]->[622mb]/[10.6gb]}
[2020-08-10T17:49:46,188][WARN ][o.e.m.j.JvmGcMonitorService] [master-node-1] [gc][3246] overhead, spent [1s] collecting in the last [1.2s]
[2020-08-10T17:50:18,715][WARN ][o.e.t.TransportService   ] [master-node-1] Received response for a request that has timed out, sent [14408ms] ago, timed out [4403ms] ago, action [internal:coordination/fault_detection/follower_check], node [{data-node-2}{YKaOz7KJRbiYEVw61Qk4tg}{oSTRQmJ4TMCqDer1jfhw1Q}{<data-node-ip2>}{<data-node-ip2>:12302}{d}{rackhost=<host-1>, xpack.installed=true}], id [18418]
[2020-08-10T17:58:57,664][WARN ][o.e.t.TransportService   ] [master-node-1] Received response for a request that has timed out, sent [29014ms] ago, timed out [19009ms] ago, action [internal:coordination/fault_detection/follower_check], node [{data-node-2}{YKaOz7KJRbiYEVw61Qk4tg}{oSTRQmJ4TMCqDer1jfhw1Q}{<data-node-ip2>}{<data-node-ip2>:12302}{d}{rackhost=<host-1>, xpack.installed=true}], id [23574]
[2020-08-10T17:58:57,665][WARN ][o.e.t.TransportService   ] [master-node-1] Received response for a request that has timed out, sent [18008ms] ago, timed out [8004ms] ago, action [internal:coordination/fault_detection/follower_check], node [{data-node-2}{YKaOz7KJRbiYEVw61Qk4tg}{oSTRQmJ4TMCqDer1jfhw1Q}{<data-node-ip2>}{<data-node-ip2>:12302}{d}{rackhost=<host-1>, xpack.installed=true}], id [23626]
[2020-08-10T17:59:04,385][INFO ][o.e.c.s.MasterService    ] [master-node-1] node-left[{data-node-2}{Nkr81e3DQI-sxIRYlu-4FQ}{6aUvApY1RsSwJ_udnX2zxA}{<data-node-ip2>}{<data-node-ip2>:12303}{d}{rackhost=<host-1>, xpack.installed=true} reason: followers check retry count exceeded], term: 9, version: 4757, delta: removed {{data-node-2}{Nkr81e3DQI-sxIRYlu-4FQ}{6aUvApY1RsSwJ_udnX2zxA}{<data-node-ip2>}{<data-node-ip2>:12303}{d}{rackhost=<host-1>, xpack.installed=true}}
[2020-08-10T17:59:54,764][INFO ][o.e.c.s.ClusterApplierService] [<master-node-1>] added {{data-node-3}{Nkr81e3DQI-sxIRYlu-4FQ}{6aUvApY1RsSwJ_udnX2zxA}{<data-node-ip3}{<data-node-ip3:12303}{d}{rackhost=hostname, xpack.installed=true}}, term: 9, version: 4760, reason: Publication{term=9, version=4760}
[2020-08-10T17:59:54,778][INFO ][o.e.c.s.MasterService    ] [<master-node-1>] node-left[{data-node-3}{Nkr81e3DQI-sxIRYlu-4FQ}{6aUvApY1RsSwJ_udnX2zxA}{<data-node-ip3}{<data-node-ip3:12303}{d}{rackhost=hostname, xpack.installed=true} reason: disconnected], term: 9, version: 4761, delta: removed {{data-node-3}{Nkr81e3DQI-sxIRYlu-4FQ}{6aUvApY1RsSwJ_udnX2zxA}{<data-node-ip3}{<data-node-ip3:12303}{d}{rackhost=hostname, xpack.installed=true}}
[2020-08-10T17:59:54,820][INFO ][o.e.c.s.ClusterApplierService] [<master-node-1>] removed {{data-node-3}{Nkr81e3DQI-sxIRYlu-4FQ}{6aUvApY1RsSwJ_udnX2zxA}{<data-node-ip3}{<data-node-ip3:12303}{d}{rackhost=hostname, xpack.installed=true}}, term: 9, version: 4761, reason: Publication{term=9, version=4761}
[2020-08-10T18:00:44,832][INFO ][o.e.c.c.C.CoordinatorPublication] [master-node-1] after [10s] publication of cluster state version [4771] is still waiting for {data-node-1}{b1E-_be2SsuL0Y59OUEVEQ}{5vM-N5-5QXe2cKFyQA3daA}{<data-node-1-ip>}{<data-node-1-ip>:12301}{d}{rackhost=<hostname-1>, xpack.installed=true} [SENT_APPLY_COMMIT]

Data Node :

On data node, I can see GC info. and node disconnected exception and rejoin exception.

Caused by: org.elasticsearch.transport.RemoteTransportException: [master-node-1][master-node-ip:15301][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{data-node-2}{YKaOz7KJRbiYEVw61Qk4tg}{oSTRQmJ4TMCqDer1jfhw1Q}{data-node-ip}{data-node:12302}{d}{rackhost=hostname, xpack.installed=true}] has been removed from the cluster

Also, here is some other configuration in the elasticsearch.yml file.

xpack.ml.enabled: false
node.transform: false
node.ml: false
xpack.transform.enabled: false
node.remote_cluster_client: false
discovery.seed_providers: file

Also, I am maintaining a text file on each of the nodes and it contains all master node IP.

Thanks

node-left[{data-node-2}{...} reason: followers check retry count exceeded]
node-left[{data-node-3}{...} reason: disconnected]

This looks to be a connectivity issue, disconnected means a connection was broken by something outside of Elasticsearch, and followers check retry count exceeded could be explained by poor connectivity too.

See this section of the manual for more details.

@DavidTurner -- Thanks for the reply. I checked the connectivity between nodes, it looks good.

If connectivity will be the issue, with normal load, all the things are working fine.

Have some questions.

  • I am using aggregation and fetching 500 records in a single shot. Does it make a short-circuit for elastic data nodes?
  • Also. some of my fields which I am querying are supporting full text search. will it cause any issue?

Do we need to change any parameter like caching size or any other to make search more faster?

When you say connectivity "looks good" what exactly do you mean? Elasticsearch is reporting disconnected indicating that there is a connectivity issue.

Is there any way I can check connectivity issue?

With the normal user's load, none of my nodes get down or none of my shards are getting unassigned.

I can't really help with that, sorry, I can only tell you about the Elasticsearch side of things. Elasticsearch only ever disconnects from the master when shutting down so these disconnections are being caused by something outside of Elasticsearch. It's certainly possible that a poorly-configured or overloaded network can result in disconnections under load, but digging into the details is outside of my area I'm afraid.

Thanks David. I will check that. :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.