I am asking for help again.
I have created ES cluster to evaluate it's performance for our use cases. The cluster lives in MS Azure cloud and has 6 nodes - 1 dedicated master (only) node, 2 hot data nodes and 3 warm data nodes. Hot nodes contain single index with data of the current year, warm nodes - 5 indexes for the past 5 years. Each index has about 200GB data (190 million documents). Indexes have 5 shards so each shard is about 40GB with 40M docs. The hardware used is:
Master node: 1 vCPU 1.75GB RAM
Hot nodes: 4 vCPU 7GB RAM with 1TB HDD
Warm nodes: 4 vCPU 7GB RAM with 2TB HDD
Perhaps the RAM is too small for the data volume still I would like to understand the limits and what one can do with what configurations.
My problem is I do not understand how to use ES metrics to understand what is going on. I collect statistics (every 10 seconds) and create visualisations in Kibana. I see the pictures but they do not tell me anything particularly bad (in my understanding). Still I had problems with reindexing (scroll data lost) and a case recently (2018-03-05T15:14:45) when one of nodes restarted. Nothing in the ES log data or performance metrics tells me about the reasons. I looked at the Centos logs as well.
I send herewith some visualizations of the time when W1 node yesterday restarted. The master node log messages of the time when W1 became unavailable are at the bottom.
Will be grateful for any suggestions.
[2018-03-05T14:21:26,048][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [M1] failed to execute on node [XzzBKysnS8-waG9OmIZhlQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [W1][10.0.0.5:9300][cluster:monitor/nodes/stats[n]] request_id  timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:940) [elasticsearch-6.1.3.jar:6.1.3]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:568) [elasticsearch-6.1.3.jar:6.1.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_45]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]