Logstash is not ingesting data from all nodes

I am using filebeat to ship logs to logstash and using it to ingest data to Index. From 2 days i am seeing multiple issues.

  1. from Kibana dashboard i am not seeing logs on regular basis of the particular index. // for this i tried to see logstash logs i can see logs are running fine (if i restart service) then i am able to see logs, with this i can see there some sort of delay to populate in the dashboard.

  2. For any given service we usually run 40+ nodes which all come with filebeat installed, but i am not seeing some instance logs in the kibana dashboard. // for this i tried to see logstash logs grepping with IP but i don't see particular node logs. We are using same beats file across all nodes but not able to understand why i am not seeing anything logs coming from that particular IP.

Which version of Elasticsearch are you using?

What is the size and specification of your Elasticsearch cluster? What type of hardware are you using?

What is the full output of the cluster stats API?

Is there anything in the Elasticsearch logs that indicate issues or errors?

I am using 7.17 version.
Elastic search and logstash are running on 16 gigs, 4 vCPU on 2 seperate nodes.

{"_nodes":{"total":3,"successful":3,"failed":0},"cluster_name":"elasticsearch","cluster_uuid":"aAeJWK8FRTSeyDaac4_GZw","timestamp":1673513937719,"status":"yellow","indices":{"count":44,"shards":{"total":44,"primaries":44,"replication":0.0,"index":{"shards":{"min":1,"max":1,"avg":1.0},"primaries":{"min":1,"max":1,"avg":1.0},"replication":{"min":0.0,"max":0.0,"avg":0.0}}},"docs":{"count":255045676,"deleted":408045},"store":{"size_in_bytes":132162572193,"total_data_set_size_in_bytes":132162572193,"reserved_in_bytes":0},"fielddata":{"memory_size_in_bytes":1416,"evictions":0},"query_cache":{"memory_size_in_bytes":14480,"total_count":3162,"hit_count":491,"miss_count":2671,"cache_size":39,"cache_count":42,"evictions":3},"completion":{"size_in_bytes":0},"segments":{"count":400,"memory_in_bytes":7897418,"terms_memory_in_bytes":5537864,"stored_fields_memory_in_bytes":649376,"term_vectors_memory_in_bytes":0,"norms_memory_in_bytes":706112,"points_memory_in_bytes":0,"doc_values_memory_in_bytes":1004066,"index_writer_memory_in_bytes":326940254,"version_map_memory_in_bytes":7291102,"fixed_bit_set_memory_in_bytes":193544,"max_unsafe_auto_id_timestamp":1673493028872,"file_sizes":{}},"mappings":{"field_types":[{"name":"alias","count":9,"index_count":9,"script_count":0},{"name":"boolean","count":305,"index_count":28,"script_count":0},{"name":"byte","count":9,"index_count":9,"script_count":0},{"name":"constant_keyword","count":33,"index_count":11,"script_count":0},{"name":"date","count":588,"index_count":33,"script_count":0},{"name":"flattened","count":108,"index_count":9,"script_count":0},{"name":"float","count":99,"index_count":16,"script_count":0},{"name":"geo_point","count":81,"index_count":9,"script_count":0},{"name":"half_float","count":32,"index_count":8,"script_count":0},{"name":"histogram","count":9,"index_count":9,"script_count":0},{"name":"integer","count":88,"index_count":4,"script_count":0},{"name":"ip","count":146,"index_count":11,"script_count":0},{"name":"keyword","count":10029,"index_count":33,"script_count":0},{"name":"long","count":2208,"index_count":31,"script_count":0},{"name":"match_only_text","count":522,"index_count":9,"script_count":0},{"name":"nested","count":132,"index_count":16,"script_count":0},{"name":"object","count":3454,"index_count":33,"script_count":0},{"name":"scaled_float","count":72,"index_count":9,"script_count":0},{"name":"text","count":434,"index_count":29,"script_count":0},{"name":"version","count":3,"index_count":3,"script_count":0},{"name":"wildcard","count":135,"index_count":9,"script_count":0}],"runtime_field_types":[]},"analysis":{"char_filter_types":[],"tokenizer_types":[],"filter_types":[],"analyzer_types":[],"built_in_char_filters":[],"built_in_tokenizers":[],"built_in_filters":[],"built_in_analyzers":[]},"versions":[{"version":"7.17.7","index_count":44,"primary_shard_count":44,"total_primary_bytes":132162572193}]},"nodes":{"count":{"total":3,"coordinating_only":0,"data":0,"data_cold":1,"data_content":1,"data_frozen":0,"data_hot":1,"data_warm":1,"ingest":1,"master":1,"ml":0,"remote_cluster_client":0,"transform":0,"voting_only":0},"versions":["7.17.7","7.17.8"],"os":{"available_processors":12,"allocated_processors":12,"names":[{"name":"Linux","count":3}],"pretty_names":[{"pretty_name":"Ubuntu 20.04.5 LTS","count":3}],"architectures":[{"arch":"amd64","count":3}],"mem":{"total_in_bytes":50328584192,"free_in_bytes":12158894080,"used_in_bytes":38169690112,"free_percent":24,"used_percent":76}},"process":{"cpu":{"percent":8},"open_file_descriptors":{"min":353,"max":829,"avg":511}},"jvm":{"max_uptime_in_millis":520809764,"versions":[{"version":"19.0.1","vm_name":"OpenJDK 64-Bit Server VM","vm_version":"19.0.1+10-21","vm_vendor":"Oracle Corporation","bundled_jdk":true,"using_bundled_jdk":true,"count":2},{"version":"19","vm_name":"OpenJDK 64-Bit Server VM","vm_version":"19+36-2238","vm_vendor":"Oracle Corporation","bundled_jdk":true,"using_bundled_jdk":true,"count":1}],"mem":{"heap_used_in_bytes":11067341576,"heap_max_in_bytes":25165824000},"threads":147},"fs":{"total_in_bytes":3171279933440,"free_in_bytes":3025203310592,"available_in_bytes":3025152978944},"plugins":[],"network_types":{"transport_types":{"security4":3},"http_types":{"security4":3}},"discovery_types":{"zen":3},"packaging_types":[{"flavor":"default","type":"deb","count":3}],"ingest":{"number_of_pipelines":24,"processor_stats":{"conditional":{"count":0,"failed":0,"current":0,"time_in_millis":0},"convert":{"count":0,"failed":0,"current":0,"time_in_millis":0},"geoip":{"count":0,"failed":0,"current":0,"time_in_millis":0},"grok":{"count":0,"failed":0,"current":0,"time_in_millis":0},"gsub":{"count":0,"failed":0,"current":0,"time_in_millis":0},"pipeline":{"count":0,"failed":0,"current":0,"time_in_millis":0},"remove":{"count":0,"failed":0,"current":0,"time_in_millis":0},"rename":{"count":0,"failed":0,"current":0,"time_in_millis":0},"script":{"count":0,"failed":0,"current":0,"time_in_millis":0},"set":{"count":0,"failed":0,"current":0,"time_in_millis":0},"set_security_user":{"count":0,"failed":0,"current":0,"time_in_millis":0},"user_agent":{"count":0,"failed":0,"current":0,"time_in_millis":0}}}}}

I don't see any errors in es.

But what i am thinking is this logstash is completely over loaded CPU wise. Hence it is taking lot's of time to ingest/write date to index.

"versions": ["7.17.7", "7.17.8"],

It looks like you have node of two different versions in the cluster. Make sure all nodes are running 7.17.8.

What type of storage are you using?

AWS EBS volume with SSD GP3 version, I have upgraded ES node.
In any given time we will be having 200 nodes that will run with 2 grok patterns and filters. Now i have upgraded logstash to 64gigs with 16 vCPU. Any bench marks i can test on? I am seeing bit delay data ingestion i guess that is expected. But i am worried about data loss.

Ok to avoid this i am using SQS. So far so good.