I am using AWS ElasticSearch service and the problem I am facing while ingestion happens on one index, it serves Very high CPU utilization (sometimes 1 or 2 master/data node goes down and sometimes ingestion fails). As result it logs warnings/errors as:
[WARN ][o.e.c.NodeConnectionsService] [7d6f3c47d7f582ced2c090fbf6a3afe5] failed to connect to node {a80c4027a8ff7917bd4f7h8j9k8g5f4d}{PsiWPcLZQEarONo143BLzQ}{ER6LljwCQfGnD4nm42n2TQ}{__IP__}{__IP__}{distributed_snapshot_deletion_enabled=true, __AMAZON_INTERNAL__, __AMAZON_INTERNAL__, cross_cluster_transport_address=__IP__} (tried [1] times) org.elasticsearch.transport.ConnectTransportException: [a80c4027a8ff7917bc0c3767dde0f72e][__IP__] connect_exception at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1299) ~[elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:99) ~[elasticsearch-7.1.1.jar:7.1.1] indent preformatted text by 4 spaces
[WARN ][o.e.t.TransportService ] [0b1fd2f1876cdcee4abd7a1dcee545454f] Received response for a request that has timed out, sent [18406ms] ago, timed out [3401ms] ago, action [__PATH__[n]], node [{784e7ea9800931208c1a36c04db940e3}{SITLwWDBTpKr2kKrHvkIRQ}{Gq22BKc7SYSuorfCb5lrKA}{__IP__}{__IP__}{distributed_snapshot_deletion_enabled=true, __AMAZON_INTERNAL__, __AMAZON_INTERNAL__, cross_cluster_transport_address=__IP__}], id [3297456]
My ES cluster have :
Data Nodes : 3 (i3.2xlarge.elasticsearch)
Master Nodes: 3 (c5.xlarge.elasticsearch)
Number of Indexes : 18 (Average size of each Index is 15 GB)
Where 9 indexes have 1 Primary and 2 Replica shards and rest Indexes have 2 Primary and 2 Replica shards.