How to Improve ELK Performance

Hi Team,

My flow is, JSON Logs files will be processed from FileBeat > Logstash > Elasticsearch. But it skips some data.

In thread_pool It shows below many rejected write requests.

node-1 write               0 0    0
node-2 write               0 0 151424
node-3 write               0 0    573

I can see below 2 errors many times in recent Logstash logs.

Error 1:

[INFO ][logstash.outputs.elasticsearch][inventory] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of processing of [213657339][indices:data/write/bulk[s][p]]: request: BulkShardRequest [inventory][0]] containing [600] requests, target allocation id: AGnVTdHZSo-QDyap7mq4qg, primary term: 7 on EsThreadPoolExecutor[name = node-2/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@2f18ea8e[Running, pool size = 8, active threads = 8, queued tasks = 2539, completed tasks = 9050422]]"})

Error 2:

[INFO][logstash.outputs.elasticsearch][rtv] retrying failed action with response code: 429 ({"type"=>"circuit_breaking_exception", "reason"=>"[parent] Data too large, data for [<transport_request>] would be [8267602870/7.6gb], which is larger than the limit of [8094194073/7.5gb], real usage: [8266697320/7.6gb], new bytes reserved: [905550/884.3kb], usages [request=0/0b, fielddata=13447/13.1kb, in_flight_requests=3286973706/3gb, accounting=53142083/50.6mb]", "bytes_wanted"=>8267602870, "bytes_limit"=>8094194073, "durability"=>"TRANSIENT"})

Filebat YML:

  scan_frequency: 30m
  ignore_older: 73h
  close_inactive: 72h
  clean_inactive: 74h

Logstash pipeline.yml

- pipeline.id: order
   path.config: "order.conf"
   queue.type: persisted
   pipeline.workers: 10
   pipeline.batch.size: 1000
   pipeline.batch.delay: 5

Elastic configuration:

Version: 7.6.0 (we will upgrade in near future)
Nodes: 3
Disk Available: 32.79% || 96.2 GB / 293.4 GB
JVM Heap: 51.83% || 12.3 GB / 23.8 GB
Indices: 70
Documents: 186,098,201
Disk Usage: 141.4 GB
Primary Shards: 78
Replica Shards: 78

Logstash/filebeat server: 3

As per the CPU usage chart, it spikes to 90%-95% for a few seconds, every time filebeat read the data and then back to 1%-10%.

We have 30 pipelines running and each configured with different log paths. For each pipeline, a new log file will be pushed every 2 hours. Approx content each file will have is 10000 logs.

Please suggest how can I improve ELK performance.

Based on the error messages it looks like you are overloading the cluster and that the Elasticsearch heap is too small for your load.

How many indices and shards are you actively indexing into? How many different Elasticsearch output blocks do you have in your config?

Are you using time-based indices? Are you indexing immutable documents and allowing Elasticsearch to assign the document ID?

What is your average document size? Have you followed these guidelines?

What is the specification of the cluster in terms of CPU, memory, heap and type of storage used?

Hi @Christian_Dahlqvist ,

15-20 indices. index config is 1 shard 1 replica.

In total, I have 20 logstash configs. each has 1-2 elasticsearch output blocks.

No

Documents are mutable. DocumentId is an Explicit Id generated using fingerprint pugin in logsatsh.

Each document will approx have 20 fields. - 1 KB

Nodes: 3
Disk Available: 32.16% -  94.4 GB / 293.4 GB
JVM Heap: 53.48% -  12.7 GB / 23.8 GB
indices: 70
Documents: 188,730,418
Disk Usage: 146.1 GB
Primary Shards: 78
Replica Shards: 78
"thread_pool" : {
"write" : {
        "queue_size" : "200",
        "size" : "8"
      }
}

jvm.options
-Xms16g
-Xmx16g

Are these defined as independent pipelines similar to the order pipeline shown earlier? If so, do they all have such high number of pipeline workers and batch size? Does each pipeline index into a single index?

What type of storage does the cluster have? Local SSDs?

If the nodes have 16GB heap, how do you end up with a total of 23.8 GB? Are all the nodes the same size and specification?

Yes, all pipelines are defined as independent pipelines with the same config. most of all pipeline index into single index but 3-4 pipeline indexes into 2 different indexes based on value. Also, the data volume of each log file is different, for most of it is 1MB-10MB(1000 -10,000 documents) but for 2 log files will be of 1-1.5 GB(10,00,000-20,00,000 documents) . Please suggest how what value I should set to handle this.

Yes Local SSD.

my Cluster has 3 modes each having 8GB so total 24GB in the cluster. But in jvm.option config we specified xmx and xms 16GB.

Can you provide the full output of the cluster stats API. Based on error message 2 it seems like your heap may be 8GB and not 16GB.

I have just edited this can u plz suggest if this causing issue.

Output of cluster stats

{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "elasticsearch_vms_prod",
  "cluster_uuid" : "EKWAEuLWSvqoGGw8kfOoWA",
  "timestamp" : 1655200466249,
  "status" : "green",
  "indices" : {
    "count" : 70,
    "shards" : {
      "total" : 156,
      "primaries" : 78,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 10,
          "avg" : 2.2285714285714286
        },
        "primaries" : {
          "min" : 1,
          "max" : 5,
          "avg" : 1.1142857142857143
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 188842771,
      "deleted" : 24028367
    },
    "store" : {
      "size_in_bytes" : 157050946893
    },
    "fielddata" : {
      "memory_size_in_bytes" : 87016,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 1275774824,
      "total_count" : 168611050,
      "hit_count" : 5326317,
      "miss_count" : 163284733,
      "cache_size" : 62321,
      "cache_count" : 168149,
      "evictions" : 105828
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 1589,
      "memory_in_bytes" : 144731958,
      "terms_memory_in_bytes" : 91180630,
      "stored_fields_memory_in_bytes" : 47609552,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 1181760,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 4760016,
      "index_writer_memory_in_bytes" : 4203544,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 1592000,
      "max_unsafe_auto_id_timestamp" : 1655164806472,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 3,
      "coordinating_only" : 0,
      "data" : 3,
      "ingest" : 3,
      "master" : 3,
      "ml" : 3,
      "voting_only" : 0
    },
    "versions" : [
      "7.6.0"
    ],
    "os" : {
      "available_processors" : 24,
      "allocated_processors" : 24,
      "names" : [
        {
          "name" : "Linux",
          "count" : 3
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Red Hat Enterprise Linux Server 7.6 (Maipo)",
          "count" : 3
        }
      ],
      "mem" : {
        "total_in_bytes" : 48003846144,
        "free_in_bytes" : 3718291456,
        "used_in_bytes" : 44285554688,
        "free_percent" : 8,
        "used_percent" : 92
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 0
      },
      "open_file_descriptors" : {
        "min" : 780,
        "max" : 1059,
        "avg" : 964
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 12757830494,
      "versions" : [
        {
          "version" : "13.0.2",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "13.0.2+8",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 3
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 10073213000,
        "heap_max_in_bytes" : 25560612864
      },
      "threads" : 341
    },
    "fs" : {
      "total_in_bytes" : 315079139328,
      "free_in_bytes" : 115267661824,
      "available_in_bytes" : 100721975296
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 3
      },
      "http_types" : {
        "security4" : 3
      }
    },
    "discovery_types" : {
      "zen" : 3
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "tar",
        "count" : 3
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 2,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        }
      }
    }
  }
}

Based on that it looks like your nodes each have 16GB RAM and the heap set to 8GB. It seems like you are overloading the cluster, so I would recommend reducing the number of pipeline workers in order to not overload the cluster. Maybe set it to 2 for all of them and see if that helps. If that does not help you may also try reducing the batch size. The default I believe is 125 so 1000 is a significant increase and will lead to more data being in flight and requiring more memory on the nodes.

Hi @Christian_Dahlqvist, sure we will reduce the pipelines but I doubt that our elastic and logstash are running on different servers still reducing pipeline workers for logstash will it improove elastic cluster performance?

Also for each pipeline, we have set the pipeline.batch.delay: to 5 does it cause any issue on performance?

Also when it gives error [logstash.outputs.elasticsearch] "reason"=>"[parent] Data too large, It indicates logstash heap or elastic heap?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.