Cluster unstable and running slowly

mvleandro · January 2, 2017, 9:56pm

Hello everyone!

I am running a cluster with 7 nodes under version 5.1.1.

I've upgraded this cluster from old 2.x version 20 days before and he was running ok until today.

After today 2pm all the requests made to the any node of the cluster are delaying a lot.

I checked if i have a problem with I/O and Network throughput, but it's not the case.
I disabled some monitoring agents to isolate the problem.

I collected some logs from my current master node and uploaded to my google drive.
My cluster log: https://drive.google.com/open?id=0B2uG9PA8RQJGX0xqNmJCUEM1Snc
My cluster deprecation log: https://drive.google.com/open?id=0B2uG9PA8RQJGcEh0R1JDTG9vaU0

I'm not sure if the cluster upgrade caused the problem or some another factor that i can't see right now.

May someone give me a light about what could be happening?

Mark_Harwood · January 3, 2017, 12:10pm

I'm seeing a lot of date/time parsing errors in the first of those log files..

mvleandro · January 3, 2017, 1:30pm

Updating:

I stopped to send data to the cluster and restarted all nodes. So i wait the cluster get green and restarted to send data to the cluster again.

After this action my cluster back to running ok and i have no idea what could happened.

Maybe some 5.x version bug?

mvleandro · January 3, 2017, 1:35pm

Hello Mark!

Thanks for your answer. So, the DateTime parsing failures are a known failures that'll be fixed soon. But i have a question: Can this failure make a total cluster stop?

Mark_Harwood · January 3, 2017, 1:42pm

Throwing an exception is one of the more expensive things you can do in Java. If you're doing this on every doc in a firehose of log records I imagine that could get costly.

mvleandro · January 3, 2017, 1:45pm

Thanks again. I'll enforce this thing with teams that sending this failed requests to the cluster.

Mark_Harwood · January 3, 2017, 1:47pm

with extreme prejudice...

thiago · January 3, 2017, 3:22pm

Also, it seems that you are sending these requests to a master node. You should not send requests to master nodes if you want better cluster stability (it is even worse if the requests generates parsing exceptions).

Consider using dedicated masters (an odd number, like 1, 3, 5 - 3 dedicated masters is recommended for high availability). They don't need to be as beefy as your data nodes (2GB and 4 cores should be more than enough for your cluster) but they need to fail independently. Also, remember to set discovery.zen.minimum_master_nodes to "(total_masters / 2) + 1".

mvleandro · January 4, 2017, 1:25pm

Thank you Thiago!

I'll put 3 dedicated masters on in my cluster and after that, i'll update this thread with the results.
Also, i would like to know if exists a formula to calculate the ideal number of master nodes in the cluster.

thiago · January 4, 2017, 1:42pm

There is no formula besides that it should be an odd number and that they fail independently. 3 dedicated masters will give you high availability (2 must fail at the same time for the cluster to become unavailable).

mvleandro · January 4, 2017, 1:55pm

Ok. Thank you again!

system · February 1, 2017, 1:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.