Understanding ES performance metrics

Juris_Rats · March 9, 2018, 9:04am

Hi,
I am asking for help again.

I have created ES cluster to evaluate it's performance for our use cases. The cluster lives in MS Azure cloud and has 6 nodes - 1 dedicated master (only) node, 2 hot data nodes and 3 warm data nodes. Hot nodes contain single index with data of the current year, warm nodes - 5 indexes for the past 5 years. Each index has about 200GB data (190 million documents). Indexes have 5 shards so each shard is about 40GB with 40M docs. The hardware used is:
Master node: 1 vCPU 1.75GB RAM
Hot nodes: 4 vCPU 7GB RAM with 1TB HDD
Warm nodes: 4 vCPU 7GB RAM with 2TB HDD
Perhaps the RAM is too small for the data volume still I would like to understand the limits and what one can do with what configurations.
My problem is I do not understand how to use ES metrics to understand what is going on. I collect statistics (every 10 seconds) and create visualisations in Kibana. I see the pictures but they do not tell me anything particularly bad (in my understanding). Still I had problems with reindexing (scroll data lost) and a case recently (2018-03-05T15:14:45) when one of nodes restarted. Nothing in the ES log data or performance metrics tells me about the reasons. I looked at the Centos logs as well.
I send herewith some visualizations of the time when W1 node yesterday restarted. The master node log messages of the time when W1 became unavailable are at the bottom.
Will be grateful for any suggestions.

**
[2018-03-05T14:21:26,048][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [M1] failed to execute on node [XzzBKysnS8-waG9OmIZhlQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [W1][10.0.0.5:9300][cluster:monitor/nodes/stats[n]] request_id [15161728] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:940) [elasticsearch-6.1.3.jar:6.1.3]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.3.jar:6.1.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_45]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]

system · April 6, 2018, 9:04am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Overall cluster performance is relation to number of shards Elasticsearch	2	612	June 9, 2018
Indexing Performance in ES 5.1.1 Elasticsearch	3	408	November 13, 2018
ES cluster throughput drops with 6 node cluster Elasticsearch	5	496	April 16, 2020
Case studies of successful ES clusters in production Elasticsearch	5	691	July 5, 2017
Elasticsearch index throughtput Elasticsearch	15	1585	April 17, 2019

Understanding ES performance metrics

Related topics