Logstash fell under load

Hello. I have three clients. They send logs via filebeat to main server with logstash. A few days ago there are a lot of logs somewhere around 12 M in an amount. The logstash couldn't process all this logs and it fell. Does it have any throttling settings to avoid this situation again?

You need to share the pipeline configuration you are using and your logstash.yml to help understand what is happening.

What do you run on the logstash machine, only logstash or other applications? What is the hardware specs? Are you using persisted queues or in-memory queues?

Also, share the logstash log with the errors when it crashed.

Hi @Andrey_RF

In addition to the above.

did you make any changes / scale up the JVM heap for Logstash see here ..

If you are on larger host you can certainly ignore the 8GB top, I have had to scale above that for some intensive workloads.

A not to point out the obvious, there are a couple good sections in the docs on scaling and performance tuning.

pipeline:

input {
  beats {
    port => 5000
  }
}

filter {

  if 'django' in [tags] {
    grok {
      break_on_match => false
      match => {
        "message" => [
          "^%{LOGLEVEL:log-level} %{TIMESTAMP_ISO8601:timestamp}",
          "Пользователь - (?<login>[^\;]+)",
          "Имя: (?<name>[^\;]+)",
          "id - (?<user-id>[^\;]+)",
          "email - (?<email>[^\;]+)",
          "ip - %{IP:ip}",
          'Запрос: "(?<request>[^\"]+)',
          'Метод: "(?<method>[^\"]+)',
          "Модуль: (?<module>[^\;]+)",
          "Функция: (?<func>[^\;]+)"
        ]
      }
    }
  } else if 'gunicorn' in [tags] {
    grok {
      match => { "message" => '^%{LOGLEVEL:log-level} %{TIMESTAMP_ISO8601:timestamp}%{GREEDYDATA:message}' }
      overwrite => [ "message" ]
    }

     grok {
      break_on_match => false
      match => { "message" => [
        '^ \n\tСообщение: %{IP:ip} "%{WORD:http-method} (?<http-url>[^\s]+) (?<http-protocol>[^\"]+)" (?<status>[\d]+) "URL: (?<url>[^\"]+)" "[^\"]+";',
        'В модуле: (?<module>[^\n]+)'
        ]
      }
    }
  } else if 'nginx' in [tags] {
    grok {
      match => {
         "message" => [
           '^%{IP:client-ip} - - \[(?<timestamp>[\d]+/[\w]+/[\d]+:[\d]+:[\d]+:[\d]+) \+[\d]+\] "%{WORD:http-method} (?<http-url>[^\s]+) (?<http-protocol>[^\"]+)" (?<status-code>[\d]+) (?<bytessent>[\d]+) "(?<refferer>[^\"]+)" "(?<user-agent>[^\"]+)" "-"$',
           '^%{IP:client-ip} - - \[(?<timestamp>[\d]+/[\w]+/[\d]+:[\d]+:[\d]+:[\d]+) \+[\d]+\]',
           "^(?<timestamp>[\d]+/[\d]+/[\d]+ [\d]+:[\d]+:[\d]+) \[%{LOGLEVEL:log-level}\]"
         ]
      }
    }
  } else {
    grok {
      match => [
        "message", "(?m)%{TIMESTAMP_ISO8601:timestamp} \[%{LOGLEVEL:log-level}\]%{GREEDYDATA:message}", # rabbitmq.info

        "message", "\[%{TIMESTAMP_ISO8601:timestamp}: %{LOGLEVEL:log-level}", # celery.log

        "message", "%{TIMESTAMP_ISO8601:timestamp}" # попытка просто достать лог
      ]
    }
  }

  date {
      match => ["timestamp", "yyyy-MM-dd HH:mm:ss,SSS", "yyyy-MM-dd HH:mm:ss", "yyyy-MM-dd HH:mm:ss.SSS", "yyyy/MM/dd HH:mm:ss", "dd/MMM/yyyy:HH:mm:ss", "ISO8601"]
      timezone => "Europe/Moscow"
      remove_field => [ "timestamp" ]
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logstash-%{[host][hostname]}"
  }
}

filebeat.yml:

# ============================== Filebeat inputs ===============================

filebeat.inputs:

- type: log
  enabled: true

  paths:
    - /var/log/eias-web/backend.*.log

  tags: ["django"]

  multiline.type: pattern
  multiline.pattern: '^INFO|^ERROR|^WARNING|^CRITICAL'
  multiline.negate: true
  multiline.match: after

- type: log
  enable: true

  paths:
    - /var/log/eias-web/fingerprint.info.log
    - /var/log/eias-web/gunicorn.*.log

  tags: ["gunicorn", "fingerpirnt"]

  multiline.type: pattern
  multiline.pattern: '^INFO|^ERROR|^WARNING|^CRITICAL'
  multiline.negate: true
  multiline.match: after

- type: log
  enabled: true

  paths:
    - /var/log/eias-web/celery.log

  tags: ["celery"]
  miltiline.type: pattern
  multiline.pattern: '^\['
  multiline.negate: true
  multiline.match: after

- type: log
  enable: true

  paths:
    - /var/log/eias-web/crash.*
    - /var/log/eias-web/rabbitmq.*.log

  tags: ["crash", "rabbitmq"]

  multiline.type: pattern
  multiline.pattern: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
  multiline.negate: true
  multiline.match: after

# filestream is an experimental input. It is going to replace log input in the future.
- type: filestream

  # Change to true to enable this input configuration.
  enabled: false

  # Paths that should be crawled and fetched. Glob based paths.
  paths:
    - /var/log/*.log
    #- c:\programdata\elasticsearch\logs\*

  
# ============================== Filebeat registry =============================

filebeat.registry.path: ${path.data}/registry
filebeat.registry.file_permissions: 0600


# ============================== Filebeat modules ==============================

filebeat.config.modules:
  # Glob pattern for configuration loading
  path: ${path.config}/modules.d/*.yml

  # Set to true to enable config reloading
  reload.enabled: false

  # Period on which files under path should be checked for changes
  #reload.period: 10s

# ======================= Elasticsearch template setting =======================

setup.template.settings:
  index.number_of_shards: 1
  #index.codec: best_compression
  #_source.enabled: false


# ================================== General ===================================


# ================================= Dashboards =================================


# =================================== Kibana ===================================



# =============================== Elastic Cloud ================================


# ================================== Outputs ===================================

# Configure what output to use when sending the data collected by the beat.

# ---------------------------- Elasticsearch Output ----------------------------

# ------------------------------ Logstash Output -------------------------------
output.logstash:
  # The Logstash hosts
  hosts: ["ip:5000"]

  # Optional SSL. By default is off.
  # List of root certificates for HTTPS server verifications
  #ssl.certificate_authorities: ["/etc/pki/root/ca.pem"]

  # Certificate for SSL client authentication
  #ssl.certificate: "/etc/pki/client/cert.pem"

  # Client Certificate Key
  #ssl.key: "/etc/pki/client/cert.key"

# ================================= Processors =================================
processors:
  - add_host_metadata:
      when.not.contains.tags: forwarded
  - add_cloud_metadata: ~
  - add_docker_metadata: ~
  - add_kubernetes_metadata: ~

# ================================== Logging ===================================



# ============================= X-Pack Monitoring ==============================
# Filebeat can export internal metrics to a central Elasticsearch monitoring
# cluster.  This requires xpack monitoring to be enabled in Elasticsearch.  The
# reporting is disabled by default.

# Set to true to enable the monitoring reporter.
#monitoring.enabled: false

# Sets the UUID of the Elasticsearch cluster under which monitoring data for this
# Filebeat instance will appear in the Stack Monitoring UI. If output.elasticsearch
# is enabled, the UUID is derived from the Elasticsearch cluster referenced by output.elasticsearch.


# ============================== Instrumentation ===============================

# Instrumentation support for the filebeat.
#instrumentation:
    # Set to true to enable instrumentation of filebeat.
    #enabled: false

    # Environment in which filebeat is running on (eg: staging, production, etc.)
    #environment: ""

    # APM Server hosts to report instrumentation results to.
    #hosts:
    #  - http://localhost:8200

    # API Key for the APM Server(s).
    # If api_key is set then secret_token will be ignored.
    #api_key:

    # Secret token for the APM Server(s).
    #secret_token:


# ================================= Migration ==================================

# This allows to enable 6.7 migration aliases
#migration.6_to_7.enabled: true

It runs logstash, elasticsearch and kibana on the same machine. It's 192 G free space. Idk :slight_smile: How can I check it ?

logstash is in docker and logs don't save.

It's docker-compose settings for logstash:

logstash:
    build:
      context: logstash/
    container_name: logstash
    volumes:
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
      - ./logstash/pipeline:/usr/share/logstash/pipeline:ro
    ports:
      - "5000:5000"
    environment:
      LS_JAVA_OPTS: "-Xmx256m -Xms256m"
    networks:
      - elk
    depends_on:
      - elasticsearch

First thing I would try is set Logstash heap to 4GB.

You could mount the Logstash logs to the a volume so you could see the logs.

How much heap are you giving elasticsearch?

How much total RAM and CPU on the server this is all running on?

  elasticsearch:
    build:
      context: elasticsearch/
    container_name: elasticsearch
    volumes:
      - ./elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
      - /home/elasticsearch:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    environment:
      ES_JAVA_OPTS: "-Xmx4096m -Xms4096m"
free -h
              total        used        free      shared  buff/cache   available
Mem:            11G        5.5G        317M         69M        5.7G        5.7G
Swap:          5.9G         39M        5.8G
cat /proc/cpuinfo | grep core
cpu cores	: 2
cpu MHz		: 1995.000

Ok

So 4GB to elasticsearch
So try to give 2GB to Logstash.

Not a huge server to run this all on...

Only 2 cores.. that elasticsearch and Logstash are fighting to over.

It is Generally not best practice to run Logstash and elasticsearch on same server.

With docker so/so ...

Maybe for small testing but for production you would need a bigger server... And make sure elasticsearch and Logstash have plenty of ram and CPU

The 11M logs over what time frame?

We have not so big a website now. It's only 200-400 requests per hour and ELK works good. We will extend the machine if the load increase. I just want to find something like throttling to avoid the situation like this.

It was 11 M logs for 15-30 minutes.

Well that is quite a spike right

10M events in 30 min is 5.5K events / sec that is several thousand times your normal load.... It's going to be very hard to design a system that is both only for 400 per hour but also 5000 per second those are two pretty different systems.

BUT that said you could look at the persistent queue ... But I still not sure that'll work.

If you know when that spike is going to come you could scale up log stash and then scale it back down.

But this is a classic system design question of designing for average or peak usage... There is always trade-offs.

I see. It was at first time for all time. Mb it was a ddos or something else.

I will read about queue in logstash, thank you

Yeah usually people try to detect DDOS closer to the edge /FW etc or you could do something like Kafka which is much better at managing back pressure but go ahead and try the persistent queue first but that is a huge Spike

Ohhh ... apologies @Andrey_RF

I maybe misunderstood was this the very first time you started logstash? if so it may be going back and reading all the old logs / files in the directory... that's what it does so it may try to have loaded all the old logs...

Or was it definitely a spike within time?

No no no. It had worked a week before it fallen. I have a correct log's timestamp so you can see a chart.

Yup that wont be easy... :slight_smile:

You could try putting in more heap and some persistent queues... that may or may not help.

While the system keeps up... there will be nothing in the queue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.