Ingest Rate in Elasticsearch is Slow

Hi Team,

We had scenario where we want to load 1 billion data to elastic cluster (consists of two node) and this process we are doing through filebeat process. As per Kibana dashboard we are checking it is below 5k/s . We increased harvestor_buffer_size and worker size also but it is not getting helped. Below is by filebeat.yml files.
Note:- We are splitting files into two server and trying to load data with two filebeat instance from two different server.

###################### Filebeat Configuration Example #########################

# This file is an example configuration file highlighting only the most common
# options. The filebeat.reference.yml file from the same directory contains all the
# supported options with more comments. You can use it as a reference.
#
# You can find the full configuration reference here:
# https://www.elastic.co/guide/en/beats/filebeat/index.html

# For more available modules and options, please see the filebeat.reference.yml sample
# configuration file.

# ============================== Filebeat inputs ===============================

filebeat.inputs:

# Each - is an input. Most options can be set at the input level, so
# you can use different inputs for various configurations.
# Below are the input specific configurations.

# filestream is an input for collecting log messages from files.

  # Unique ID among all inputs, an ID is required.

  # Change to true to enable this input configuration.

  # Paths that should be crawled and fetched. Glob based paths.

- type: log
  enabled: true
  paths:
   - /disk2/ss7-edr-generator/result/*.csv # path to your CSV file
  exclude_lines: ['^\"\"']          # header line
  index: elastic
  pipeline: parse_elastic_data
  harvestor_buffer_size: 40960000



  # Exclude lines. A list of regular expressions to match. It drops the lines that are
  # matching any regular expression from the list.
  # Line filtering happens after the parsers pipeline. If you would like to filter lines
  # before parsers, use include_message parser.
  #exclude_lines: ['^DBG']

  # Include lines. A list of regular expressions to match. It exports the lines that are
  # matching any regular expression from the list.
  # Line filtering happens after the parsers pipeline. If you would like to filter lines
  # before parsers, use include_message parser.
  #include_lines: ['^ERR', '^WARN']

  # Exclude files. A list of regular expressions to match. Filebeat drops the files that
  # are matching any regular expression from the list. By default, no files are dropped.
  #prospector.scanner.exclude_files: ['.gz$']

  # Optional additional fields. These fields can be freely picked
  # to add additional information to the crawled log files for filtering
  #fields:
  #  level: debug
  #  review: 1

# ============================== Filebeat modules ==============================

filebeat.config.modules:
  # Glob pattern for configuration loading
  path: ${path.config}/modules.d/*.yml
  # Set the number of workers
  number_of_workers: 2
  bulk_max_size: 4096

  # Set to true to enable config reloading
  reload.enabled: false

  # Period on which files under path should be checked for changes
  #reload.period: 10s

# ======================= Elasticsearch template setting =======================

setup.template.settings:
  index.number_of_shards: 1
  #index.codec: best_compression
  #_source.enabled: false


# ================================== General ===================================

# The name of the shipper that publishes the network data. It can be used to group
# all the transactions sent by a single shipper in the web interface.
#name:

# The tags of the shipper are included in their own field with each
# transaction published.
#tags: ["service-X", "web-tier"]

# Optional fields that you can specify to add additional information to the
# output.
#fields:
#  env: staging

# ================================= Dashboards =================================
# These settings control loading the sample dashboards to the Kibana index. Loading
# the dashboards is disabled by default and can be enabled either by setting the
# options here or by using the `setup` command.
#setup.dashboards.enabled: false

# The URL from where to download the dashboards archive. By default this URL
# has a value which is computed based on the Beat name and version. For released
# versions, this URL points to the dashboard archive on the artifacts.elastic.co
# website.
#setup.dashboards.url:

# =================================== Kibana ===================================

# Starting with Beats version 6.0.0, the dashboards are loaded via the Kibana API.
# This requires a Kibana endpoint configuration.
setup.kibana:

  # Kibana Host
  # Scheme and port can be left out and will be set to the default (http and 5601)
  # In case you specify and additional path, the scheme is required: http://localhost:5601/path
  # IPv6 addresses should always be defined as: https://[2001:db8::1]:5601
  host: "http://xx.xx.xx.xx:5601"

  # Kibana Space ID
  # ID of the Kibana Space into which the dashboards should be loaded. By default,
  # the Default Space will be used.
  #space.id:

# =============================== Elastic Cloud ================================

# These settings simplify using Filebeat with the Elastic Cloud (https://cloud.elastic.co/).

# The cloud.id setting overwrites the `output.elasticsearch.hosts` and
# `setup.kibana.host` options.
# You can find the `cloud.id` in the Elastic Cloud web UI.
#cloud.id:

# The cloud.auth setting overwrites the `output.elasticsearch.username` and
# `output.elasticsearch.password` settings. The format is `<user>:<pass>`.
#cloud.auth:

# ================================== Outputs ===================================

# Configure what output to use when sending the data collected by the beat.

# ---------------------------- Elasticsearch Output ----------------------------
output.elasticsearch:
  # Array of hosts to connect to.
  hosts: ["https://xx.xx.xx.xx:9200","https://xx.xx.xx.xx:9200"]
  worker: 8
  bulk_max_size: 3000
  # Protocol - either `http` (default) or `https`.
  protocol: "https"

  # Authentication credentials - either API key or username/password.
  # api_key: "id:api_key"
  username: "elastic"
  password: "elastic"
  ssl:
    enabled: true
    certificate_authorities: ["/etc/filebeat/certs/cert.pem"]
# ------------------------------ Logstash Output -------------------------------
#output.logstash:
  # The Logstash hosts
  #hosts: ["localhost:5044"]

  # Optional SSL. By default is off.
  # List of root certificates for HTTPS server verifications
  #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]

  # Certificate for SSL client authentication
  #ssl.certificate: "/etc/pki/client/cert.pem"

  # Client Certificate Key
  #ssl.key: "/etc/pki/client/cert.key"

# ================================= Processors =================================
processors:
  - add_host_metadata:
      when.not.contains.tags: forwarded
  - add_cloud_metadata: ~
  - add_docker_metadata: ~
  - add_kubernetes_metadata: ~

# ================================== Logging ===================================

# Sets log level. The default log level is info.
# Available log levels are: error, warning, info, debug
logging:
  level: info
  to_files: true
  files:
    path: /var/log/filebeat
    name: filebeat.log
    keepfiles: 7

# At debug level, you can selectively enable logging only for some components.
# To enable all selectors use ["*"]. Examples of other selectors are "beat",
# "publisher", "service".
#logging.selectors: ["*"]

path.logs: /var/log/filebeat

# ============================= X-Pack Monitoring ==============================
# Filebeat can export internal metrics to a central Elasticsearch monitoring
# cluster.  This requires xpack monitoring to be enabled in Elasticsearch.  The
# reporting is disabled by default.

# Set to true to enable the monitoring reporter.
#monitoring.enabled: false

monitoring:
  enabled: true
  elasticsearch:
    username: beats_system
    password: beats_system

# Sets the UUID of the Elasticsearch cluster under which monitoring data for this
# Filebeat instance will appear in the Stack Monitoring UI. If output.elasticsearch
# is enabled, the UUID is derived from the Elasticsearch cluster referenced by output.elasticsearch.
#monitoring.cluster_uuid:

# Uncomment to send the metrics to Elasticsearch. Most settings from the
# Elasticsearch output are accepted here as well.
# Note that the settings should point to your Elasticsearch *monitoring* cluster.
# Any setting that is not set is automatically inherited from the Elasticsearch
# output configuration, so if you have the Elasticsearch output configured such
# that it is pointing to your Elasticsearch monitoring cluster, you can simply
# uncomment the following line.
#monitoring.elasticsearch:

# ============================== Instrumentation ===============================

# Instrumentation support for the filebeat.
#instrumentation:
    # Set to true to enable instrumentation of filebeat.
    #enabled: false

    # Environment in which filebeat is running on (eg: staging, production, etc.)
    #environment: ""

    # APM Server hosts to report instrumentation results to.
    #hosts:
    #  - http://localhost:8200

    # API Key for the APM Server(s).
    # If api_key is set then secret_token will be ignored.
    #api_key:

    # Secret token for the APM Server(s).
    #secret_token:


# ================================= Migration ==================================

# This allows to enable 6.7 migration aliases
#migration.6_to_7.enabled: true

Please advise if any parameter can speed up the ingest rate.

Thanks,
Debasis

Did this improve the ingest rate?

What does Filebeat CPU usage look like?

How many files do you have in the directories being read? What is the average file size? What type of storage is this on?

It may also be worthwhile verifying that your Elasticsearch cluster is not the bottleneck. What is the specification of the Elasticsearch nodes with respect to CPU, RAM, heap and type of storage used?

What does CPU usage look like? What does disk I/O and await look like?

hi @Christian_Dahlqvist thanks for response.

Please find the response as below.

Did this improve the ingest rate?
[Debasis]:- Yes previously (two weeks before with 600M records) it helped us to reach ingest rate 15k/s. Is there any alternative to improve the ingest rate since we need to load 1.5 billion recs.

What does Filebeat CPU usage look like?
[Debasis]:- In both server filebeat usage is noraml w.r.t cpu and memory we did not see any bottleneck.

How many files do you have in the directories being read? What is the average file size? What type of storage is this on?
[Debasis]:-- Each directory having around 25k number of file and size is 25 MB of each file. The disk used in Elastic and filebeat servers are SSDs.

It may also be worthwhile verifying that your Elasticsearch cluster is not the bottleneck. What is the specification of the Elasticsearch nodes with respect to CPU, RAM, heap and type of storage used?
[Debasis]:- We checked in grafana console there is no bottleneck w.r.t CPU and memory but while doing iostat(iostat -ckNx 4) we found that the disk wkb/s is only in max 700 kb/s which we are checking with IT team.
Here storage in elastic nodes are SSDs. The JVM heap allocated 4GB on each node.

What does CPU usage look like? What does disk I/O and await look like?
[Debasis]:- Here very minimal cpu usage less than 30 percent and no i/o waits also.

Thanks,
Debasis

Is this local SSDs or some kind of SSDs based storage accessed across a network?

4GB of heap sounds quite low for a high ingest scenario where ingest pipeline(s) are used. What does heap usage look like in your monitoring? Is there any evidence in the Elasticsearch of frequent or long GC?

Could you please let me know how to check this one. As per kibana dashboard JVM usage is below the the max usage on both the nodes.
jvm

Thanks,
Debasis

@Christian_Dahlqvist Did you get chance look into the above issue.

Thanks,
Debasis

@Christian_Dahlqvist Could you please share your thought.

Thanks,
Debasis

The heap usage looks good, so that does not seem to be the issue.

You did not answer my question about the storage. If you are using local SSDs and I/O stats (await) look fine it is likely that it is Filebeat not feeding data at a sufficient rate.

I have never configured or optimised Filebeat for reading large volumes of data files like in your use case, so would need to leave input around that to someone else.

Yes these are local SSD. So are you saying to load such type of large volume of data shall we go for logstash. Please suggest.

Thanks,
Debasis

I do not know whether Logstash would make a difference or not, nor whether it is possible to better tune Filebeat for your use case.