Adding more data nodes decreased primary indexing rate

Hi There,

I have been running different capacity and performance test scenarios on our ELK cluster and today I have noticed something that didn't make sense to me.

I had 4 data nodes and 2 coordinating nodes that I send my LS output to. Each index has 4 primary shards and 1 replica.
I was able to see a constant primary indexing rate of 3k /s coming from 3 different LS nodes.
The data is metricbeat data nothing beyond that.

I wanted to be able to increase my indexing rate so I have decided to add two more data nodes so now I have 6 and increased number of primary shards to 6.

Any idea why once the 2 extra data nodes were added, with the same exact settings as the other 4, my indexing rate has dropped from 3k to around 800 /s ?

What version?
What hardware?
What JVM?
What settings?

What version?

5.4

What hardware?

We are running the cluster on VMs, each has its own.
6 cpu for each data node, 24GB RAM and 12GB for heap
data stored locally.

What JVM?

java version 1.8.0_111

What settings?

Nothing special mostly defaults on ES and recently have increased the thread_pool bulk queue size, with 4 nodes and 6 nodes.
All our LS->ES outputs are directed to the ES cluster through the coordinating nodes.

Unfortunately with 6 nodes we see data barely streaming across, with very low utilization of resources.

How are you measuring this?

What's sending data to ES?

What's sending data to ES?

As I mentioned above we have only metricbeat data at this point, sending metrics from so many different servers to LS (for future purposes) no filtering is done for now. Then all of the LS nodes send to the ES cluster.

We have tested with two scenarios when we have Kafka between source and LS and with no Kafka, we got almost the same results.

How are you measuring this?

we use the monitoring feature in Kibana.

can someone help me understand why when I run the the thread_pool cat API, for bulk, I see almost all my data nodes queue empty ? is this expected behavior ?

I increased the queue_size hoping to get ore data but after running the API I realized the queue is not even reaching the 200 default size.

> node_name  name active rejected completed queue_size queue max min type
> data-1   bulk      1        0   1378783       1000     0   7   7 fixed
> data-3   bulk      0        0   1278412       1000     0   7   7 fixed
> data-2   bulk      2        0   1428337       1000     0   7   7 fixed
> data-4   bulk      7        0   1403869       1000   187   7   7 fixed

It's good, it means Elasticsearch is coping with the traffic.
Try increasing the size of the bulk request you are sending.

Thanks!

So in my case since my data is coming through logstash would I be able to increase the bulk size request through the batch size and workers or would it be something else ?

As I have looked previously at the logstash elasticsearch output plugin for such setting, to try and increase my bulk size request but didn't see any options for that.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.