Cluster unstable and running slowly


(Marcus Vinicius) #1

Hello everyone!

I am running a cluster with 7 nodes under version 5.1.1.

I've upgraded this cluster from old 2.x version 20 days before and he was running ok until today.

After today 2pm all the requests made to the any node of the cluster are delaying a lot.

I checked if i have a problem with I/O and Network throughput, but it's not the case.
I disabled some monitoring agents to isolate the problem.

I collected some logs from my current master node and uploaded to my google drive.
My cluster log: https://drive.google.com/open?id=0B2uG9PA8RQJGX0xqNmJCUEM1Snc
My cluster deprecation log: https://drive.google.com/open?id=0B2uG9PA8RQJGcEh0R1JDTG9vaU0

I'm not sure if the cluster upgrade caused the problem or some another factor that i can't see right now.

May someone give me a light about what could be happening?


(Mark Harwood) #2

I'm seeing a lot of date/time parsing errors in the first of those log files..


(Marcus Vinicius) #3

Updating:

I stopped to send data to the cluster and restarted all nodes. So i wait the cluster get green and restarted to send data to the cluster again.

After this action my cluster back to running ok and i have no idea what could happened.

Maybe some 5.x version bug?


(Marcus Vinicius) #4

Hello Mark!

Thanks for your answer. So, the DateTime parsing failures are a known failures that'll be fixed soon. But i have a question: Can this failure make a total cluster stop?


(Mark Harwood) #5

Throwing an exception is one of the more expensive things you can do in Java. If you're doing this on every doc in a firehose of log records I imagine that could get costly.


(Marcus Vinicius) #6

Thanks again. I'll enforce this thing with teams that sending this failed requests to the cluster.


(Mark Harwood) #7

with extreme prejudice...


(Thiago Souza) #8

Also, it seems that you are sending these requests to a master node. You should not send requests to master nodes if you want better cluster stability (it is even worse if the requests generates parsing exceptions).

Consider using dedicated masters (an odd number, like 1, 3, 5 - 3 dedicated masters is recommended for high availability). They don't need to be as beefy as your data nodes (2GB and 4 cores should be more than enough for your cluster) but they need to fail independently. Also, remember to set discovery.zen.minimum_master_nodes to "(total_masters / 2) + 1".


(Marcus Vinicius) #9

Thank you Thiago!

I'll put 3 dedicated masters on in my cluster and after that, i'll update this thread with the results.
Also, i would like to know if exists a formula to calculate the ideal number of master nodes in the cluster.


(Thiago Souza) #10

There is no formula besides that it should be an odd number and that they fail independently. 3 dedicated masters will give you high availability (2 must fail at the same time for the cluster to become unavailable).


(Marcus Vinicius) #11

Ok. Thank you again! :slight_smile:


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.