Elasticsearch Performance Issue After Adding Second Node

Hello everyone,

I’m currently experiencing a performance issue with my Elasticsearch cluster.

Previously, I was using a single-node setup, and all queries—especially _count—were working quickly and reliably. Recently, I added a second node on a separate server, and while the node successfully joined the cluster, the performance has significantly degraded.

For example, running a simple GET /_count now results in a timeout (Bad Request), whereas it used to return instantly with the single node.

Cluster Setup:

  • Master Node Specs:
    • RAM: 128 GB
    • Storage: 1 TB
  • Second Node Specs:
    • RAM: 64 GB
    • Storage: 1 TB
  • Data Volume:
    • Approx. 1 billion documents
    • Approx. 9,000 indices

I would really appreciate your guidance on how to investigate and resolve this performance issue.

Thank you in advance.

Best regards,

Out of interest, why did you add a second node? What was the driver? More disk space, something related to redundancy (?), hope for better performance, ... ?

Are the shards from 9000 indices now spread across the 2 nodes? Was replica count 0 before, or was it 1?

Pls share output from these 2 calls:

/_cluster/health

/_cat/nodes?v&h=name,version,node.role,cpu,ram.percent,heap.percent,disk.used_percent,load_1m,load_5m,load_15m

Why I added a second node:

The main reason was to improve performance and scalability. I’m working with a large dataset (~1 billion documents across ~9,000 indices), so I assumed distributing the load over two nodes would speed things up. I also aimed to improve redundancy and fault tolerance.

:backhand_index_pointing_right: Before adding the second node:

  • Replica count was set to 0.
  • All shards were stored on the single node.
  • After adding the second node, I updated the replica count to 1, so shards are now distributed across both nodes.

:backhand_index_pointing_right: Requested Output:

1. /cluster/health:

json

CopyEdit

{
  "cluster_name": "gig-at",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 2,
  "number_of_data_nodes": 2,
  "active_primary_shards": 8456,
  "active_shards": 8494,
  "relocating_shards": 8,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

2. /_cat/nodes?v&h=name,version,node.role,cpu,ram.percent,heap.percent,disk.used_percent,load_1m,load_5m,load_15m:

pgsql

CopyEdit

name   version node.role cpu ram.percent heap.percent disk.used_percent load_1m load_5m load_15m
node-1 8.8.2   dim        5     70           68             60               1.25    1.12     0.98
node-2 8.8.2   di         3     55           40             58               0.80    0.76     0.65

Thank you for answering with the output requested, not everyone does so and it is appreciated.

It's not the topic you asked about, but you have not improved your fault tolerance. Before, when you had one node, you had no fault tolerance - if that node were unavailable then you had a problem. That is still the case, that node remains a SPOF, it's the only master-eligible node. A 2-node cluster, were you set both nodes as master-eligible, is not actually any better either, at least not in that sense. And I'm not clear what data has been replicated either ...

You said you set replicas to 1, but did you do so for all 8000+ indices, or just a few ?

Because, as assumption that all indices are 1 primary shard, your output suggests very few indices have any replica shards

  "active_primary_shards": 8456,
  "active_shards": 8494 <-- this should be roughly be 2x the number above if all shards are replicated

In the cluster health output, I notice 8 relocating_shards. Has this ever reached zero?

I'm wondering what is really going on here, has your cluster just "balanced" you 8000+ indices across the 2 nodes, and that balancing is still ongoing maybe, and thus sucking a lot of bandwidth between the 2 nodes. Whats the connectivity between the 2 nodes, in terms of available bandwidth and latency?

Do a GET on

/_cat/shards?h=p

and pipe output into sort | uniq -c

In my tiny 3-node cluster I get:

$ escurl "/_cat/shards?h=p" | sort | uniq -c
      6 p
      6 r

i.e. I have 6 primary shards and 6 replica shards.

Also

$ escurl "/_cat/shards?h=node" | sort | uniq -c
      4 node1
      4 node2
      4 node3

so those 12 total shards are distributed 4/4/4 across my 3 nodes.

I'd be curious to your outputs for the same.

(escurl is just a shell alias to the curl command with relevant arguments)

1 Like