AWS EC2 based cluster best practices

Hi @Lior_Yakobov,

3 Dedicated Master, 3 shards 1 replica and each shard is 33.33 GB

Filebeat --> Kafka --> Production LS (15 Pods) --> ES Clients (4 load balanced) --> ES Hot nodes (6 Nodes)

In templates

    "index": {
      "lifecycle": {
        "name": "production-ilm",
        "rollover_alias": "production"
      },
      "routing": {
        "allocation": {
          "require": {
            "box_type": "hot-app-logs"
          },
          "total_shards_per_node": "1"
        }
      },
      "mapping": {
        "total_fields": {
          "limit": "1000"
        }
      },
      "refresh_interval": "30s",
      "number_of_shards": "3",
      "translog": {
        "durability": "async"
      },
      "soft_deletes": {
        "enabled": "true"
      },

Hot Nodes -

Mixed instance ASG ( 16 cpu /128 mem) for each pod with 14/114. Heap is 30 GB
elasticsearch.yaml (below settings help to ingest more)

indices.memory.index_buffer_size: 30%
indices.memory.min_index_buffer_size: 96mb

ILM Policies are used to rollover data into Warm nodes.

Warm nodes -
Mixed instance ASG ( 16 cpu /64 mem) for each pod with 14/57. Heap is 30 GB
elasticsearch.yaml (below settings help to allow caching more to give better search results )

    indices.queries.cache.size: 40%
    indices.requests.cache.size: 25%

FYI - High OS Cpu usage on 7.7.1

We did verify something odd about >7.5 ES releases that leads to high (Almost 100%) OS cpu usage, but same configuration with 7.4.0 it is normal.

We also converted everything to 1AZ now to save $$$ as everything is in EBS volume, we trust AWS to not clobber that.

hope this helps.

We ingest sometime more than 6 TB of data a day, and there is lag but for few hours.