ES Cluster Perf is lower than single node ES

Hi

I am using ES veriosn 6.6.0

I use es for heavy load index, almost 20MBPS, ~80K rps
HW deatails, each node is 88CPUc(44 core), 32GB(ES)+32GB(lucen) RAM, 5 SSDs RAID 10.
each index is of size 60GB primary, total 120GB.

There is continuous traffic going on. we keep last 13 indices and delete older index after 12hrs.

when we do this test with one node of ES (use one node data node, 0 replication)out of 3nodes everything goes well. it out performs. But,
when i add two more data nodes and make rep as 1, all goes on toss. not able to get much performance, reduced the replication to 0 in 3node and tried, not helped anything.

here are the index settings:

"logs-2020.08.01.12": {
  "settings": {
    "index": {
  "mapping": {
    "total_fields": {
      "limit": "3000"
    }
  },
  "refresh_interval": "30s",
  "number_of_shards": "1",
  "translog": {
    "flush_threshold_size": "1g",
    "sync_interval": "10s",
    "durability": "async"
  },
  "provided_name": "logs-2020.08.01.12",
  "merge": {
    "scheduler": {
      "max_thread_count": "16"
    }
  },
  "creation_date": "1596285442601",
  "unassigned": {
    "node_left": {
      "delayed_timeout": "10m"
    }
  },
  "number_of_replicas": "0",
  "uuid": "7avWG6PLQHmTapyxl7nJEg",
  "version": {
    "created": "6060099"
  }
}

}
}
}

can any one help me with what could be wrong? in 3node setup its sustaining only 50K rps, after that old gen goes on 100% usage and when i reduce traffic its able to recover, but what could be the reason it works in one node and not in 3 nodes.

Try setting the number of primary shards for the index to 3 and leave the number of replicas at 0. Replicas do the same work as the primary, so adding a replica shard will increase resiliency but likely reduce indexing throughput. One factor that can impact performance as soon as you start clustering is network performance. What kind of networking do you have in place?

Hi

I had kept primary shards by default as 3 and replica as 0. Tried by keeping shards also as 6.

What kind of interface u mean?

We have elasticsearch installed in kubernetes pods. i see huge MB's ~18MB between the pods using iftop -B.

so I tried by having only one ES node, it works has improved good. but when i have 3 containers issues starts.

What kind of network you mean? these are standard dell servers with ~1GB interface.

above example shows 1 shard because i reduced num of nodes to 1, sorry about that

Here are some configurations

indices.fielddata.cache.size: 10%

indices.memory.index_buffer_size: 30%

thread_pool.write.queue_size: 2000

"index.number_of_replicas": 1,
     "index.number_of_shards": 3,
     "index.merge.scheduler.max_thread_count": 16,
     "index.refresh_interval": "30s",
     "index.translog.durability": "async",
     "index.translog.flush_threshold_size": "1g",
     "index.translog.sync_interval": "10s",
     "index.unassigned.node_left.delayed_timeout": "10m",
     "index.mapping.total_fields.limit": 3000

If you only have ~1GB networking that could very well be what is limiting indexing throughput. I have seen nodes being limited by network at even lower throughput levels than you are seing. There will be a lot more data transferred when you have a cluster compared to a standalone node.

How did you arrive at the non-default settings you have here?

I am not sure about network interface bw, but sure i will get details and share here

we are using es since almost 3 years, i have asked many qns in this forum and YOU always helped me with lot of answers thanks for that

These i have tuned these many days, till now we were using HDD and was able to get max 20K, but now we have SSD and trying to get max rps.

Here is the network info

driver: tg3

version: 3.137

firmware-version: FFV21.60.2 bc 5720-v1.39

expansion-rom-version: 

**bus-info: 0000:04:00.0**

supports-statistics: yes

supports-test: yes

supports-eeprom-access: yes

supports-register-dump: yes

supports-priv-flags: no

It's 1GBPS

If the indexing rate goes down with three nodes compared to a single node even if you configure no replicas I usually suspect that network performance is the limiting factor, especially as you have lots of RAM, CPU and fast disks. If that is the case there is not a lot of room for further tuning.

Thanks for your help