Hi Vadim,
Please do post your findings. I'd be very interested. We're having similar issues with cluster crashes, though have yet to find the root cause. Our setup is similar to Noberto's suggestion. Awhile back, client apps unicast to all nodes. Due to cluster stability issues, we changed this so client apps only talk to the masters. It really helped, and the cluster was stable for about a month. Then it crashed recently. After a complete restart, the cluster can't seem to stay up for more than 1-2 hrs. There is no indexing or search activity at this time. Yet, we're seeing nodes go in and out out the cluster, including masters, which just drives the elected master crazy. To the point where making a cluster health REST request to the master just hangs for a long time.
Thanks,
-Vinh
On Jul 15, 2013, at 2:35 AM, Norberto Meijome numard@gmail.com wrote:
Hi Vadim,
Do u have any disk activity?
We had similar cases in AWS where nodes would peg CPU and usually get bound to io too.
I changed the cluster so that we have 3 master nodes , with no data stored in them, on smaller instances. Then your data nodes, all of them obviously configured with master= false. The app servers speak to the masters only, via load balancers. On one hand, this smoothed out crazy spikes and all nodes are pretty much loaded quite evenly, and haven't seen a case where a node gets locked , or worse , a brain split.
As always, YMMV .
On 15/07/2013 7:09 PM, "Vadim Kisselmann" v.kisselmann@gmail.com wrote:
Hi Boaz,
we had an bigger crash last weekend, and now we have problems to rebalance our cluster. Suspiciously again at 3am.
The master is fully loaded with 100% cpu, this seems to block disk and network on AWS, because the other nodes don't replicate anything. You can
see with atop, that disk reads/writes on master are at 0, MBr/s is between 0-1MB.
Hot_threads an master are busy with over 100% CPU load. It's weird.
elasticsearch hot_threads · GitHub
Cheers,
Vadim
Am Donnerstag, 11. Juli 2013 21:37:57 UTC+2 schrieb Boaz Leskes:
Hi Vadim,
I don't know of any bug that causes such symptoms but you never know. It may also be other stuff like scripts etc. Next time it happens (if it does, I understand it's rare) calling the hot threads api would really help diagnosing it ( Elasticsearch Platform — Find real-time answers at scale | Elastic )
Cheers,
Boaz
On Wed, Jul 10, 2013 at 11:50 AM, Vadim Kisselmann v.kiss...@gmail.com wrote:
Hi Boaz,
thanks for your reply.
- It's the default setting, 3 nodes: 5 shards x 1 replica per index.
- It was the master(high cpu load, and only the cpu, ram, hdd i/o, network, everything was fine).
After investigating everything like tomcat-logs from my services(connections, errors), settings and so on, i found nothing suspicious. Everything
is like in the past months.
I have only one idea: ES has an bug in this old version (19.11) and something caused an endless loop. Because only the cpu load was at 100% on all 8 cores, but nothing else on this machine.
Cheers,
Vadim
Am Dienstag, 9. Juli 2013 15:35:54 UTC+2 schrieb Boaz Leskes:
Hi Vadim,
Can you say a bit more about your cluster setup?
- How many primary shards you had per index? How many replicas?
- Was the node that experienced high cpu load also the cluster master at them? (you can see in the logs which node was elected master).
Cheers,
Boaz
On Tuesday, July 9, 2013 9:44:07 AM UTC+2, Vadim Kisselmann wrote:
Hi folks,
our cluster was "crashed" this night.
We have a couple symptoms and try to narrow down the problem.
Our setup: 3 Nodes in AWS, ES version 19.11, 4 indices for different services.
Master was node1. The cpu load of this node is rised suddenly to 100% from 3:00 to 3:30.
Other nodes cpu load was small. Logs are empty. It was only the cpu load, memory consumption, network, etc.everything was normal.
Services which wanted to connect to their indices are timed out after one minute with no response.
What is happened here? Could a "slow" query form only one service be a trigger for this? What about the other nodes in the cluster, why they did not provide
any results for other services from indices which are still working(on node 2 & 3)?
The full cluster restart was the only solution for this.
But how can we prevent this case(one node down, the whole cluster does not answer)?
Cheers,
Vadim
--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/89DaUKYrw4s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.