Hello everyone! Greetings!
I am running an elastic cluster with 3 masters and 5 data nodes, I have one coordinator node running on localhost which backsup as kibana fetch node (I mean it supports searches across the cluster)
My use case here is that I use elastic as a logging mechanism, whatever queries my server receives I bundle them up into the elastic cluster.
However one mistake I did was that I defined only one data node as es.hosts = xxx.xxx.xxx
Worthy to mention I am using spark to put everything into elastic:
JavaEsSpark.saveJsonToEs(
dataset.toJavaRDD(),
getESIndex(API_TYPE.PREDICT),
ImmutableMap.of("es.mapping.id", _ESID));
Now yesterday on the weekend lots of load increased on my system and the system was in a hung state for over 12 hours, once I came to realize this I checked the logs and found:
[WARN ][o.e.m.j.JvmGcMonitorService] [NVMBD2BFM70V03] [gc][4109061] overhead, spent [956ms] collecting in the last [1.6s]
Apart from increasing the Java memory on this node how do I prevent any such occurrences. What principles are to be followed here? Also I think elastic should have detected my whole cluster and not relied on one single node of failure, what is the config for that?
Also below listed is my cluster health:
{"cluster_name":"MACHINELEARNING","status":"green","timed_out":false,"number_of_nodes":9,"number_of_data_nodes":5,"active_primary_shards":25,"active_shards":50,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}