Data node CPU utilization goes very high

I am using AWS ElasticSearch service and the problem I am facing while ingestion happens on one index, it serves Very high CPU utilization (sometimes 1 or 2 master/data node goes down and sometimes ingestion fails). As result it logs warnings/errors as:

[WARN ][o.e.c.NodeConnectionsService] [7d6f3c47d7f582ced2c090fbf6a3afe5] failed to connect to node {a80c4027a8ff7917bd4f7h8j9k8g5f4d}{PsiWPcLZQEarONo143BLzQ}{ER6LljwCQfGnD4nm42n2TQ}{__IP__}{__IP__}{distributed_snapshot_deletion_enabled=true, __AMAZON_INTERNAL__, __AMAZON_INTERNAL__, cross_cluster_transport_address=__IP__} (tried [1] times) org.elasticsearch.transport.ConnectTransportException: [a80c4027a8ff7917bc0c3767dde0f72e][__IP__] connect_exception at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1299) ~[elasticsearch-7.1.1.jar:7.1.1] at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:99) ~[elasticsearch-7.1.1.jar:7.1.1] indent preformatted text by 4 spaces

[WARN ][o.e.t.TransportService ] [0b1fd2f1876cdcee4abd7a1dcee545454f] Received response for a request that has timed out, sent [18406ms] ago, timed out [3401ms] ago, action [__PATH__[n]], node [{784e7ea9800931208c1a36c04db940e3}{SITLwWDBTpKr2kKrHvkIRQ}{Gq22BKc7SYSuorfCb5lrKA}{__IP__}{__IP__}{distributed_snapshot_deletion_enabled=true, __AMAZON_INTERNAL__, __AMAZON_INTERNAL__, cross_cluster_transport_address=__IP__}], id [3297456]

My ES cluster have :
Data Nodes : 3 (i3.2xlarge.elasticsearch)
Master Nodes: 3 (c5.xlarge.elasticsearch)
Number of Indexes : 18 (Average size of each Index is 15 GB)

Where 9 indexes have 1 Primary and 2 Replica shards and rest Indexes have 2 Primary and 2 Replica shards.

Welcome to our community! :smiley: Please note that AWS ES is a fork of what we offer, so we may not be able to provide support for this problem.

What sort of monitoring functionality do you have in place?

Thank you Mark and yes you are right but I wanted a bit more insights on what could have happened to data node when it starts consuming more CPU

Ok, so;

Have you looked at hot threads?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.