Hi @Lior_Yakobov,
3 Dedicated Master, 3 shards 1 replica and each shard is 33.33 GB
Filebeat --> Kafka --> Production LS (15 Pods) --> ES Clients (4 load balanced) --> ES Hot nodes (6 Nodes)
In templates
"index": {
"lifecycle": {
"name": "production-ilm",
"rollover_alias": "production"
},
"routing": {
"allocation": {
"require": {
"box_type": "hot-app-logs"
},
"total_shards_per_node": "1"
}
},
"mapping": {
"total_fields": {
"limit": "1000"
}
},
"refresh_interval": "30s",
"number_of_shards": "3",
"translog": {
"durability": "async"
},
"soft_deletes": {
"enabled": "true"
},
Hot Nodes -
Mixed instance ASG ( 16 cpu /128 mem) for each pod with 14/114. Heap is 30 GB
elasticsearch.yaml (below settings help to ingest more)
indices.memory.index_buffer_size: 30%
indices.memory.min_index_buffer_size: 96mb
ILM Policies are used to rollover data into Warm nodes.
Warm nodes -
Mixed instance ASG ( 16 cpu /64 mem) for each pod with 14/57. Heap is 30 GB
elasticsearch.yaml (below settings help to allow caching more to give better search results )
indices.queries.cache.size: 40%
indices.requests.cache.size: 25%
FYI - High OS Cpu usage on 7.7.1
We did verify something odd about >7.5 ES releases that leads to high (Almost 100%) OS cpu usage, but same configuration with 7.4.0 it is normal.
We also converted everything to 1AZ now to save $$$ as everything is in EBS volume, we trust AWS to not clobber that.
hope this helps.
We ingest sometime more than 6 TB of data a day, and there is lag but for few hours.