Due to heavy load, Shards getting unassigned

het · August 10, 2020, 3:32pm

Hi,

I have a cluster contains 3 data nodes 31 GB of each, and 3 master nodes 8 GB of each. I have around a total of 252 shards. The total data size is approx. 65 GB.

I am using 7.7.1 ElasticSearch

When I run a load of approx. 600 Users, concurrent search, and 200 users indexing, My shards are going into an unassigned state within 10-15 minutes. My all nodes are up.

My nodes first getting disconnected and rejoined the cluster automatically, but after rejoining, shards get unassigned.

Wrapped by: org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:551) ~[na:na]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309) ~[na:na]
	at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:231) ~[na:na]
	at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$1(FetchSearchPhase.java:119) ~[na:na]
	at org.elasticsearch.action.search.CountedCollector.countDown(CountedCollector.java:53) ~[na:na]
	at org.elasticsearch.action.search.CountedCollector.onFailure(CountedCollector.java:76) ~[na:na]
	at org.elasticsearch.action.search.FetchSearchPhase$2.onFailure(FetchSearchPhase.java:198) ~[na:na]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[na:na]
	at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:402) ~[na:na]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) ~[na:na]
	at org.elasticsearch.transport.TransportService$8.run(TransportService.java:1001) ~[na:na]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[na:na]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:na]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:na]
	at java.lang.Thread.run(Thread.java:748) ~[na:na]

Also, I am getting below exception.

[2020-08-05T14:14:54,724][WARN ][o.e.i.c.IndicesClusterStateService] [test-cluster] [test-index][1] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:481) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:718) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:176) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:592) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:568) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:248) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:517) ~[elasticsearch-7.7.1.jar:7.7.1]
        at java.lang.Iterable.forEach(Iterable.java:75) [?:1.8.0_202]
        at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:514) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.cluster.service.ClusterApplierService.access$100(ClusterApplierService.java:73) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.7.1.jar:7.7.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [test-index][1]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]
        at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:771) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:686) ~[elasticsearch-7.7.1.jar:7.7.1]
        at org.elasticsearch.index.IndexService.createShard(IndexService.java:401) ~[elasticsearch-7.7.1.jar:7.7.1]
        ... 18 more

What changes need to be done at the cluster level to make the performance better?

Thanks

Christian_Dahlqvist · August 10, 2020, 4:46pm

What type of storage are your nodes using? What is the full output of the cluster stats API?

het · August 10, 2020, 5:21pm

Hi @Christian_Dahlqvist, Thank you for the quick response. Please have a look of the below information.

Cluster Stats.

{
"_nodes": {
    "total": 6,
    "successful": 6,
    "failed": 0
},
"cluster_name": "test-cluster",
"cluster_uuid": "aVCH8KGiQWeT-LmLaXpICA",
"timestamp": 1597079387394,
"status": "green",
"indices": {
    "count": 62,
    "shards": {
        "total": 252,
        "primaries": 126,
        "replication": 1.0,
        "index": {
            "shards": {
                "min": 4,
                "max": 8,
                "avg": 4.064516129032258
            },
            "primaries": {
                "min": 2,
                "max": 4,
                "avg": 2.032258064516129
            },
            "replication": {
                "min": 1.0,
                "max": 1.0,
                "avg": 1.0
            }
        }
    },
    "docs": {
        "count": 30567482,
        "deleted": 1052
    },
    "store": {
        "size_in_bytes": 63394460642
    },
    "fielddata": {
        "memory_size_in_bytes": 0,
        "evictions": 0
    },
    "query_cache": {
        "memory_size_in_bytes": 0,
        "total_count": 0,
        "hit_count": 0,
        "miss_count": 0,
        "cache_size": 0,
        "cache_count": 0,
        "evictions": 0
    },
    "completion": {
        "size_in_bytes": 0
    },
    "segments": {
        "count": 2301,
        "memory_in_bytes": 95544720,
        "terms_memory_in_bytes": 79381344,
        "stored_fields_memory_in_bytes": 1618360,
        "term_vectors_memory_in_bytes": 0,
        "norms_memory_in_bytes": 10557440,
        "points_memory_in_bytes": 0,
        "doc_values_memory_in_bytes": 3987576,
        "index_writer_memory_in_bytes": 0,
        "version_map_memory_in_bytes": 0,
        "fixed_bit_set_memory_in_bytes": 0,
        "max_unsafe_auto_id_timestamp": -1,
        "file_sizes": {}
    },
    "mappings": {
        "field_types": [
            {
                "name": "boolean",
                "count": 3,
                "index_count": 1
            },
            {
                "name": "date",
                "count": 808,
                "index_count": 62
            },
            {
                "name": "float",
                "count": 183,
                "index_count": 61
            },
            {
                "name": "geo_point",
                "count": 61,
                "index_count": 61
            },
            {
                "name": "integer",
                "count": 613,
                "index_count": 61
            },
            {
                "name": "keyword",
                "count": 11276,
                "index_count": 62
            },
            {
                "name": "text",
                "count": 10483,
                "index_count": 62
            }
        ]
    },
    "analysis": {
        "char_filter_types": [],
        "tokenizer_types": [],
        "filter_types": [],
        "analyzer_types": [
            {
                "name": "custom",
                "count": 62,
                "index_count": 62
            }
        ],
        "built_in_char_filters": [],
        "built_in_tokenizers": [
            {
                "name": "keyword",
                "count": 62,
                "index_count": 62
            }
        ],
        "built_in_filters": [
            {
                "name": "lowercase",
                "count": 62,
                "index_count": 62
            }
        ],
        "built_in_analyzers": []
    }
},
"nodes": {
    "count": {
        "total": 6,
        "coordinating_only": 0,
        "data": 3,
        "ingest": 0,
        "master": 3,
        "ml": 0,
        "remote_cluster_client": 0,
        "transform": 0,
        "voting_only": 0
    },
    "versions": [
        "7.7.1"
    ],
    "os": {
        "available_processors": 192,
        "allocated_processors": 192,
        "names": [
            {
                "name": "Linux",
                "count": 6
            }
        ],
        "pretty_names": [
            {
                "pretty_name": "CentOS release 6.6 (Final)",
                "count": 6
            }
        ],
        "mem": {
            "total_in_bytes": 1625048629248,
            "free_in_bytes": 461002936320,
            "used_in_bytes": 1164045692928,
            "free_percent": 28,
            "used_percent": 72
        }
    },
    "process": {
        "cpu": {
            "percent": 0
        },
        "open_file_descriptors": {
            "min": 767,
            "max": 1480,
            "avg": 1105
        }
    },
    "jvm": {
        "max_uptime_in_millis": 1045591,
        "versions": [
            {
                "version": "1.8.0_202",
                "vm_name": "Java HotSpot(TM) 64-Bit Server VM",
                "vm_version": "25.202-b08",
                "vm_vendor": "Oracle Corporation",
                "bundled_jdk": true,
                "using_bundled_jdk": false,
                "count": 6
            }
        ],
        "mem": {
            "heap_used_in_bytes": 21837092232,
            "heap_max_in_bytes": 146129682432
        },
        "threads": 755
    },
    "fs": {
        "total_in_bytes": 930503200768,
        "free_in_bytes": 762936004608,
        "available_in_bytes": 715655503872
    },
    "plugins": [],
    "network_types": {
        "transport_types": {
            "security4": 6
        },
        "http_types": {
            "security4": 6
        }
    },
    "discovery_types": {
        "zen": 6
    },
    "packaging_types": [
        {
            "flavor": "default",
            "type": "tar",
            "count": 6
        }
    ],
    "ingest": {
        "number_of_pipelines": 2,
        "processor_stats": {
            "gsub": {
                "count": 0,
                "failed": 0,
                "current": 0,
                "time_in_millis": 0
            },
            "script": {
                "count": 0,
                "failed": 0,
                "current": 0,
                "time_in_millis": 0
            }
        }
    }
}

Also, I am storing data on the local machines. The storage type is EXT4.

Christian_Dahlqvist · August 10, 2020, 5:31pm

What type of storage are you using? Local SSDs?

het · August 10, 2020, 7:42pm

Yes @Christian_Dahlqvist. Local SSDs.

Jugsofbeer · August 11, 2020, 10:26am

Do you have memory swapping to disk?

Why ssd with ext4 and not xfs? just curious.

how many ssds, are they raided ?
are the ssds also running the os?

DavidTurner · August 11, 2020, 11:31am

The problem here is the nodes leaving the cluster. When that happens all sorts of other things will go wrong.

On the elected master, look for logs containing the phrase node-left from the MasterService and share them here please.

het · August 12, 2020, 5:44pm

Sorry my bad. It is SSDs with xfs. Sorry for the confusion.

het · August 12, 2020, 6:08pm

Hi @DavidTurner, Please find the below exceptions.

First On master node and also two of the data node, I can see full GC is running. I am sharing master node logs first and then data nodes.

Master node :

[2020-08-10T17:49:46,186][WARN ][o.e.m.j.JvmGcMonitorService] [master-node-1] [gc][young][3246][3] duration [1s], collections [1]/[1.2s], total [1s]/[1.2s], memory [4.5gb]->[1.3gb]/[15.3gb], all_pools {[young] [3.9gb]->[65.2mb]/[4gb]}{[survivor] [570.4mb]->[682.6mb]/[682.6mb]}{[old] [0b]->[622mb]/[10.6gb]}
[2020-08-10T17:49:46,188][WARN ][o.e.m.j.JvmGcMonitorService] [master-node-1] [gc][3246] overhead, spent [1s] collecting in the last [1.2s]
[2020-08-10T17:50:18,715][WARN ][o.e.t.TransportService   ] [master-node-1] Received response for a request that has timed out, sent [14408ms] ago, timed out [4403ms] ago, action [internal:coordination/fault_detection/follower_check], node [{data-node-2}{YKaOz7KJRbiYEVw61Qk4tg}{oSTRQmJ4TMCqDer1jfhw1Q}{<data-node-ip2>}{<data-node-ip2>:12302}{d}{rackhost=<host-1>, xpack.installed=true}], id [18418]
[2020-08-10T17:58:57,664][WARN ][o.e.t.TransportService   ] [master-node-1] Received response for a request that has timed out, sent [29014ms] ago, timed out [19009ms] ago, action [internal:coordination/fault_detection/follower_check], node [{data-node-2}{YKaOz7KJRbiYEVw61Qk4tg}{oSTRQmJ4TMCqDer1jfhw1Q}{<data-node-ip2>}{<data-node-ip2>:12302}{d}{rackhost=<host-1>, xpack.installed=true}], id [23574]
[2020-08-10T17:58:57,665][WARN ][o.e.t.TransportService   ] [master-node-1] Received response for a request that has timed out, sent [18008ms] ago, timed out [8004ms] ago, action [internal:coordination/fault_detection/follower_check], node [{data-node-2}{YKaOz7KJRbiYEVw61Qk4tg}{oSTRQmJ4TMCqDer1jfhw1Q}{<data-node-ip2>}{<data-node-ip2>:12302}{d}{rackhost=<host-1>, xpack.installed=true}], id [23626]
[2020-08-10T17:59:04,385][INFO ][o.e.c.s.MasterService    ] [master-node-1] node-left[{data-node-2}{Nkr81e3DQI-sxIRYlu-4FQ}{6aUvApY1RsSwJ_udnX2zxA}{<data-node-ip2>}{<data-node-ip2>:12303}{d}{rackhost=<host-1>, xpack.installed=true} reason: followers check retry count exceeded], term: 9, version: 4757, delta: removed {{data-node-2}{Nkr81e3DQI-sxIRYlu-4FQ}{6aUvApY1RsSwJ_udnX2zxA}{<data-node-ip2>}{<data-node-ip2>:12303}{d}{rackhost=<host-1>, xpack.installed=true}}
[2020-08-10T17:59:54,764][INFO ][o.e.c.s.ClusterApplierService] [<master-node-1>] added {{data-node-3}{Nkr81e3DQI-sxIRYlu-4FQ}{6aUvApY1RsSwJ_udnX2zxA}{<data-node-ip3}{<data-node-ip3:12303}{d}{rackhost=hostname, xpack.installed=true}}, term: 9, version: 4760, reason: Publication{term=9, version=4760}
[2020-08-10T17:59:54,778][INFO ][o.e.c.s.MasterService    ] [<master-node-1>] node-left[{data-node-3}{Nkr81e3DQI-sxIRYlu-4FQ}{6aUvApY1RsSwJ_udnX2zxA}{<data-node-ip3}{<data-node-ip3:12303}{d}{rackhost=hostname, xpack.installed=true} reason: disconnected], term: 9, version: 4761, delta: removed {{data-node-3}{Nkr81e3DQI-sxIRYlu-4FQ}{6aUvApY1RsSwJ_udnX2zxA}{<data-node-ip3}{<data-node-ip3:12303}{d}{rackhost=hostname, xpack.installed=true}}
[2020-08-10T17:59:54,820][INFO ][o.e.c.s.ClusterApplierService] [<master-node-1>] removed {{data-node-3}{Nkr81e3DQI-sxIRYlu-4FQ}{6aUvApY1RsSwJ_udnX2zxA}{<data-node-ip3}{<data-node-ip3:12303}{d}{rackhost=hostname, xpack.installed=true}}, term: 9, version: 4761, reason: Publication{term=9, version=4761}
[2020-08-10T18:00:44,832][INFO ][o.e.c.c.C.CoordinatorPublication] [master-node-1] after [10s] publication of cluster state version [4771] is still waiting for {data-node-1}{b1E-_be2SsuL0Y59OUEVEQ}{5vM-N5-5QXe2cKFyQA3daA}{<data-node-1-ip>}{<data-node-1-ip>:12301}{d}{rackhost=<hostname-1>, xpack.installed=true} [SENT_APPLY_COMMIT]

Data Node :

On data node, I can see GC info. and node disconnected exception and rejoin exception.

Caused by: org.elasticsearch.transport.RemoteTransportException: [master-node-1][master-node-ip:15301][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{data-node-2}{YKaOz7KJRbiYEVw61Qk4tg}{oSTRQmJ4TMCqDer1jfhw1Q}{data-node-ip}{data-node:12302}{d}{rackhost=hostname, xpack.installed=true}] has been removed from the cluster

Also, here is some other configuration in the elasticsearch.yml file.

xpack.ml.enabled: false
node.transform: false
node.ml: false
xpack.transform.enabled: false
node.remote_cluster_client: false
discovery.seed_providers: file

Also, I am maintaining a text file on each of the nodes and it contains all master node IP.

Thanks

DavidTurner · August 12, 2020, 6:29pm

node-left[{data-node-2}{...} reason: followers check retry count exceeded]
node-left[{data-node-3}{...} reason: disconnected]

This looks to be a connectivity issue, disconnected means a connection was broken by something outside of Elasticsearch, and followers check retry count exceeded could be explained by poor connectivity too.

See this section of the manual for more details.

het · August 13, 2020, 5:15pm

@DavidTurner -- Thanks for the reply. I checked the connectivity between nodes, it looks good.

If connectivity will be the issue, with normal load, all the things are working fine.

Have some questions.

I am using aggregation and fetching 500 records in a single shot. Does it make a short-circuit for elastic data nodes?
Also. some of my fields which I am querying are supporting full text search. will it cause any issue?

Do we need to change any parameter like caching size or any other to make search more faster?

DavidTurner · August 13, 2020, 5:45pm

When you say connectivity "looks good" what exactly do you mean? Elasticsearch is reporting disconnected indicating that there is a connectivity issue.

het · August 13, 2020, 5:48pm

Is there any way I can check connectivity issue?

With the normal user's load, none of my nodes get down or none of my shards are getting unassigned.

DavidTurner · August 13, 2020, 6:30pm

I can't really help with that, sorry, I can only tell you about the Elasticsearch side of things. Elasticsearch only ever disconnects from the master when shutting down so these disconnections are being caused by something outside of Elasticsearch. It's certainly possible that a poorly-configured or overloaded network can result in disconnections under load, but digging into the details is outside of my area I'm afraid.

het · August 14, 2020, 11:07am

Thanks David. I will check that.

system · September 11, 2020, 11:07am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Disappearing Data and Unassigned Shards Elasticsearch	5	843	July 6, 2017
Shards assigning is taking lot of time after a node is disconnected and added to the cluster Elasticsearch	1	365	February 28, 2017
Data loss after network disconnect Elasticsearch	2	606	July 6, 2017
Nodes Disagree About Cluster Membership, Stalled Unassigned Shards Elasticsearch	5	376	July 6, 2017
How to handle system failures in Elasticsearch cluster Elasticsearch	5	405	July 6, 2017

Due to heavy load, Shards getting unassigned

Related topics