Performance weird stuff

Hello,

I have this cluster with 6 ES nodes, 3 logstashes and numbers of servers shippings logs with filebeat. Versions are 6.8.1 for filebeat, 6.6.1for logstash and 6.7.1for ES . Clusterstate is green.

The load on our platform is increasing considerably and we have huge issues at the moment.
I'll try to describe the stack.

All servers are vmware vm's
ES servers:
8 cores, 32 GB Ram of which 15GB is set as heap.
Indices are rotated daily (We can only keep info for 14 days due to gdpr laws)
Currently there are 191 indices containing 16b documents with a total size of 12TB

ELASTICSEARCH:
Process:
elastic+ 6548 1 97 10:32 ? 03:55:37 /bin/java -Xms15g -Xmx15g -XX:+UseG1GC -XX:MaxGCPauseMillis=400 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/var/elk/tmp/ -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/elk/es -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Djava.locale.providers=COMPAT -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=rpm -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet
elastic+ 6640 6548 0 10:32 ? 00:00:00 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller

settings from elasticsearch.yml:
cluster.name: adc-elk
node.name: pdbs250.grn.prd.itv.local
node.master: false
node.data: true
path.data: /var/elk/es/data
path.logs: /var/elk/es/logs
bootstrap.memory_lock: true
network.host: 10.222.249.42
transport.host: 10.222.249.42
transport.tcp.port: 9300
http.port: 9200
discovery.zen.ping.unicast.hosts: ["server1", "server2", "server3", "server4", "server5", "server6"]
discovery.zen.minimum_master_nodes: 1
thread_pool.search.queue_size: 10000
thread_pool.index.queue_size: 1000

There are more indices in trouble (not all of them),but this is the most troublesome (and one of the 2 heaviest loaded)
{
"index_patterns": [
"agl-api-"
],
"settings": {
"index": {
"refresh_interval": "30s",
"number_of_shards": "5",
"number_of_replicas": "1"
}
},
"mappings": {
"doc": {
"dynamic_templates": [
{
"message_field": {
"path_match": "message",
"match_mapping_type": "string",
"mapping": {
"norms": false,
"type": "text"
}
}
},
{
"string_fields": {
"match": "
",
"match_mapping_type": "string",
"mapping": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"norms": false,
"type": "text"
}
}
}
],
"properties": {
"@timestamp": {
"type": "date"
},
"@version": {
"type": "keyword",
"norms": false
},
"additional-data": {
"type": "text",
"norms": false
},
"api-brand": {
"type": "keyword",
"norms": false
},
"api-query-string": {
"type": "text",
"norms": false
},
"api-url": {
"type": "text",
"norms": false
},
"api-tenant": {
"type": "keyword",
"norms": false
},
"api-version": {
"type": "keyword",
"norms": false
},
"api-platform": {
"type": "keyword",
"norms": false
},
"api-nanoservice": {
"type": "keyword",
"norms": false
},
"app-id": {
"type": "keyword",
"norms": false
},
"beat": {
"properties": {
"hostname": {
"type": "text",
"norms": false
},
"name": {
"type": "text",
"norms": false
},
"version": {
"type": "text",
"norms": false
}
}
},
"client-ip": {
"type": "ip"
},
"event-type": {
"type": "keyword",
"norms": false
},
"execution-time": {
"type": "integer"
},
"fields": {
"properties": {
"environment": {
"type": "keyword",
"norms": false
}
}
},
"host": {
"type": "keyword",
"norms": false
},
"http-verb": {
"type": "keyword",
"norms": false
},
"log-level": {
"type": "keyword",
"norms": false
},
"log-message": {
"type": "text",
"norms": false
},
"ms-error-message": {
"type": "text",
"norms": false
},
"ms-request-body": {
"type": "text",
"norms": false
},
"ms-result": {
"type": "keyword",
"norms": false
},
"ms-result-code": {
"type": "text",
"norms": false
},
"ms-url": {
"type": "text",
"norms": false
},
"ms-host": {
"type": "keyword",
"norms": false
},
"offset": {
"type": "long"
},
"platform": {
"type": "keyword",
"norms": false
},
"prospector": {
"properties": {
"type": {
"type": "text",
"norms": false
}
}
},
"session-id": {
"type": "text",
"norms": false
},
"source": {
"type": "text",
"norms": false
},
"tags": {
"type": "text",
"norms": false
},
"transaction-id": {
"type": "text",
"norms": false
},
"username": {
"type": "text",
"norms": false
},
"user-id": {
"type": "text",
"norms": false
}
}
}
}
}

After rollover of the logs at 1AM, it functions for about an hour and I see data of all 32 logshippers. After that it gradually declines until extremely slow data of only 1 comes in .

LOGSTASH:
3 logstash servers with 6 cores and 16 GB of ram
Process:
logstash 35553 1 99 14:49 ? 00:01:08 /bin/java -Xms8g -Xmx8g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -Djava.io.tmpdir=/var/tmp -Djava.awt.headless=true -Dfile.encoding=UTF-8 -XX:+HeapDumpOnOutOfMemoryError -cp /usr/share/logstash/logstash-core/lib/jars/animal-sniffer-annotations-1.14.jar:/usr/share/logstash/logstash-core/lib/jars/commons-codec-1.11.jar:/usr/share/logstash/logstash-core/lib/jars/commons-compiler-3.0.8.jar:/usr/share/logstash/logstash-core/lib/jars/error_prone_annotations-2.0.18.jar:/usr/share/logstash/logstash-core/lib/jars/google-java-format-1.1.jar:/usr/share/logstash/logstash-core/lib/jars/gradle-license-report-0.7.1.jar:/usr/share/logstash/logstash-core/lib/jars/guava-22.0.jar:/usr/share/logstash/logstash-core/lib/jars/j2objc-annotations-1.1.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-annotations-2.9.5.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-core-2.9.5.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-databind-2.9.5.jar:/usr/share/logstash/logstash-core/lib/jars/jackson-dataformat-cbor-2.9.5.jar:/usr/share/logstash/logstash-core/lib/jars/janino-3.0.8.jar:/usr/share/logstash/logstash-core/lib/jars/javassist-3.22.0-GA.jar:/usr/share/logstash/logstash-core/lib/jars/jruby-complete-9.1.13.0.jar:/usr/share/logstash/logstash-core/lib/jars/jsr305-1.3.9.jar:/usr/share/logstash/logstash-core/lib/jars/log4j-api-2.9.1.jar:/usr/share/logstash/logstash-core/lib/jars/log4j-core-2.9.1.jar:/usr/share/logstash/logstash-core/lib/jars/log4j-slf4j-impl-2.9.1.jar:/usr/share/logstash/logstash-core/lib/jars/logstash-core.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.commands-3.6.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.contenttype-3.4.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.expressions-3.4.300.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.filesystem-1.3.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.jobs-3.5.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.resources-3.7.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.core.runtime-3.7.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.app-1.3.100.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.common-3.6.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.preferences-3.4.1.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.equinox.registry-3.5.101.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.jdt.core-3.10.0.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.osgi-3.7.1.jar:/usr/share/logstash/logstash-core/lib/jars/org.eclipse.text-3.5.101.jar:/usr/share/logstash/logstash-core/lib/jars/slf4j-api-1.7.25.jar org.logstash.Logstash --path.settings /etc/logstash

IS it necessary to post fileters and output as well? Output to different ES nodes is done based on tags given by filebeat.

logstash yml settings:
path.data: /var/lib/logstash
pipeline.workers: 16
pipeline.batch.size: 200
config.support_escapes: true

queue.type: persisted
log.level: info
path.logs: /var/log/logstash

I tried creating a 2nd pipeline but this did nothing to help,so for the sake of debugging I reverted back to only port 5044

FILEBEAT:
Settings:

filebeat.inputs:
- type: log
  enabled: true

  paths:
    - /product/AGL/agl-core/logs/agl.log

  exclude_files: ['\.gz$']

  multiline.pattern: '^ts:'
  multiline.negate: true
  multiline.match: after

  tags: ["avs6", "api-log", "apigateway", "asd"]

  ignore_older: 336h
  close_inactive: 5m
  close_removed: true
  clean_removed: true

  harvester_limit: 0

filebeat.config.modules:
  enabled: false

processors:
  - drop_fields:
      fields: ["host"]

fields:
  environment: production

queue.mem:
  events: 4096
  flush.min_events: 1024
  flush.timeout: 10s

output.logstash:
  enabled: true
  hosts: ["logstash1:5044","logstash2:5044","logstash3:5044"]

  loadbalance: true
  timeout: 5m
  bulk_max_size: 2048
  slow_start: true

logging:
  level: info
  metrics:
    enabled: false

The concerning logs grow to 8 to 10 GB a day per server.

What am I missing?

Interesting - You seem to have enough hardware, etc. and not too many indices, shards, etc. and modest data. 30s refresh is good, only 5 shards is not excessive, single replica is good.

Presumably you are overloading something such as disks/IO as the data starts to grow, but an hour into a 10GB day means you've only received 500MB of data (or less as it's 1am), so this makes no sense.

To confirm, the core issue is things work fine but every day one hour after rollover (all new indexes), the ingest gets slow?

What is cluster architecture, for masters, data, ingest, etc.?

Anything obvious in CPU or Heap use or IO (what are disks and where, like VMware to a SAN or something?)

Do you have Kibana monitoring on so can look at graphs for various things, like Heap/GCs?

Are you getting queues in Elasticsearch and any errors to your clients?

Seems odd a six ES node cluster starts choking on 500MB of logs an hour after rollover.

Well, this night the load was a little less, so maybe that is the reason why it took longer , but after a certain amount of time, yes, indexings seems to get stalled.

Is it ok to send screenshots of the kibmon graphs?

Yes, sure looks bad but can you answer my questions?

Hi,

Cluster is 6 nodes. 3 of the master eligible. all 6 are datanodes (used to have 1 dedicated coördinating node, but I changed this to spread indexing load)
3 logstashes. The logstahes are filed by filebeats which have them configured loadbalanced on port 5044. The outputs of the logstahes differ from eachother because I tried to spread load across the cluster. Output example of 1 of the logstashes:

output {
        # Send the events to ElasticSearch, which a different index based on the log file the events came from.
        if "api-log" in [tags] and "apigateway" in [tags] {
                elasticsearch {
                        hosts => ["pdbs250.grn.prd.itv.local:9200"]
                        #hosts => ["elasticsearch:9200"]
                        document_id => "%{[@metadata][fingerprint]}"
                        index => "agl-api-%{[@metadata][version]}-%{+YYYY.MM.dd}"
                        sniffing => false
                }
        }
        else if "api-log" in [tags] and "userprofile" in [tags] {
                elasticsearch {
                        hosts => ["pdbs287.grn.prd.itv.local:9200"]
                        #hosts => ["elasticsearch:9200"]
                        document_id => "%{[@metadata][fingerprint]}"
                        index => "userprofile-api-%{[@metadata][version]}-%{+YYYY.MM.dd}"
                        sniffing => false
                }
        }
        else if "access-log" in [tags] and "apicache" in [tags] {
                elasticsearch {
                        hosts => ["pdbs287.grn.prd.itv.local:9200"]
                        #hosts => ["elasticsearch:9200"]
                        document_id => "%{[@metadata][fingerprint]}"
                        index => "apicache-access-%{[@metadata][version]}-%{+YYYY.MM.dd}"
                        sniffing => false
                }
        }
        else if "api-log" in [tags] and "concurrentstreams" in [tags] {
                elasticsearch {
                        hosts => ["pdbs287.grn.prd.itv.local:9200"]
                        #hosts => ["elasticsearch:9200"]
                        document_id => "%{[@metadata][fingerprint]}"
                        index => "concurrentstreams-api-%{[@metadata][version]}-%{+YYYY.MM.dd}"
                        sniffing => false
                }
        }
        else if "api-log" in [tags] and "devicemanagement" in [tags] {
                elasticsearch {
                        hosts => ["pdbs287.grn.prd.itv.local:9200"]
                        #hosts => ["elasticsearch:9200"]
                        document_id => "%{[@metadata][fingerprint]}"
                        index => "devicemanagement-api-%{[@metadata][version]}-%{+YYYY.MM.dd}"
                        sniffing => false
                }
        }
        else if "api-log" in [tags] and "userentitlement" in [tags] {
                elasticsearch {
                        hosts => ["pdbs287.grn.prd.itv.local:9200"]
                        #hosts => ["elasticsearch:9200"]
                        document_id => "%{[@metadata][fingerprint]}"
                        index => "userentitlement-api-%{[@metadata][version]}-%{+YYYY.MM.dd}"
                        sniffing => false
                }
        }
        else if "access-log" in [tags] and "dcqcache" in [tags] {
                elasticsearch {
                        hosts => ["pdbs287.grn.prd.itv.local:9200"]
                        #hosts => ["elasticsearch:9200"]
                        document_id => "%{[@metadata][fingerprint]}"
                        index => "dcqcache-access-%{[@metadata][version]}-%{+YYYY.MM.dd}"
                        sniffing => false
                }
        }
        else if "api-log" in [tags] and "authentication" in [tags] {
                elasticsearch {
                        hosts => ["pdbs251.grn.prd.itv.local:9200"]
                        #hosts => ["elasticsearch:9200"]
                        document_id => "%{[@metadata][fingerprint]}"
                        index => "authentication-api-%{[@metadata][version]}-%{+YYYY.MM.dd}"
                        sniffing => false
                }
        }
        else if "api-log" in [tags] and "rpgw" in [tags] {
                elasticsearch {
                        hosts => ["pdbs287.grn.prd.itv.local:9200"]
                        # hosts => ["elasticsearch:9200"]
                        document_id => "%{[@metadata][fingerprint]}"
                        index => "rpgw-api-%{[@metadata][version]}-%{+YYYY.MM.dd}"
                        sniffing => false
                }
        }
        else if "tls-proxy" in [tags] {
                elasticsearch {
                        hosts => ["pdbs287.grn.prd.itv.local:9200"]
                        #hosts => ["elasticsearch:9200"]
                        document_id => "%{[@metadata][fingerprint]}"
                        index => "apicache-tlsproxy-%{[@metadata][version]}-%{+YYYY.MM.dd}"
                        sniffing => false
                }
        }
        #Just some debug output. Remove for production.
        #  stdout {
        #       codec => rubydebug
        #  }

        ### More debug output, do not use in production.
        # file {
        #       path => "/tmp/logstash.output"
        # }
}

There are a number of filter files. the file containing filters for the most important indexes:

filter {
        if "avs6" in [tags] and "api-log" in [tags] and "apigateway" in [tags] {
                mutate {
                strip => ["message"]
                }

                grok {
                        pattern_definitions => {
                                "TIMESTAMP" => "\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}"
                                "DELIM" => "\s*\|\s*|"
                                "STRING" => "[^|]+"
                                "GUID" => "[a-zA-Z0-9\-]+"
                        }

                        keep_empty_captures => false
                        named_captures_only => true
                        match => { "message" => "^ts:\s*%{TIMESTAMP:ts}%{DELIM}logLevel:\s*%{WORD:log-level}%{DELIM}appId:\s*%{STRING:app-id}%{DELIM}thread:\s*(%{INT})?%{DELIM}SID:\s*(%{GUID:session-id})?%{DELIM}TN:\s*(%{GUID:transaction-id})?%{DELIM}clientIp:\s*(%{STRING:client-ip})?%{DELIM}userId:\s*%{USERNAME:user-id}%{DELIM}apiType:\s*%{STRING}%{DELIM}api:\s*(%{STRING:api-url})?%{DELIM}platform:\s*(%{WORD:platform})?%{DELIM}eventType:\s*%{WORD:event-type}%{GREEDYDATA:additional-data}" }
                }

                mutate {
                        strip => ["app-id", "client-ip", "api-url"]
                }

                if [log-level] == "DEBUG" {
                        drop { }
                }
                else if [event-type] == "API_CALL" {
                        grok {
                                pattern_definitions => {
                                        "STRING" => "[^|]+"
                                }

                                keep_empty_captures => false
                                named_captures_only => true
                                match => { "additional-data" => "message:\s*%{STRING:ms-url}\|?(%{STRING:ms-result-code})?\|?(%{STRING:ms-result})?\|?(%{INT:execution-time})?" }
                        }

                        mutate {
                                strip => ["ms-url", "ms-result-code", "ms-result"]
                        }
                }
                else if [event-type] == "NONE" {
                        grok {
                                keep_empty_captures => false
                                named_captures_only => true
                                match => { "additional-data" => "message:\s*%{GREEDYDATA:log-message}" }
                        }

                        mutate {
                                strip => ["log-message"]
                        }
                }
                else if [event-type] == "APIEND" {
                        grok {
                                pattern_definitions => {
                                        "DELIM" => "\s*\|\s*|"
                                        "STRING" => "[^|]+"
                                }

                                keep_empty_captures => false
                                named_captures_only => true
                                match => {
                                        "additional-data" => [
                                                "RC:\s*%{STRING:ms-result}%{DELIM}executionTime\(ms\):\s*%{INT:execution-time}",
                                                "RC:\s*%{STRING:ms-result}%{DELIM}qs:\s*%{STRING:api-query-string}%{DELIM}executionTime\(ms\):\s*%{INT:execution-time}"
                                        ]
                                }
                        }

                        mutate {
                                strip => ["ms-result", "api-query-string"]
                        }
                }

                if [ms-url] and [ms-url] != "" {
                        grok { match => { "ms-url" => [ "[htpsHTPS]+://(?<ms-host>[a-zA-Z0-9_\-\.]+)" ] } }
                }

                if [api-url] and [api-url] != "" {
                        mutate { strip => ["api-url"] }

                        grok {
                                tag_on_failure => ["_apiparsefailure"]
                                match => {
                                        "api-url" => [ "%{WORD:http-verb}\s*/%{INT:api-tenant}/(?<api-version>[\d\.]+)/[a-zA-Z]+/[a-zA-Z]+/%{WORD:api-platform}/(?<api-brand>[a-zA-Z0-9]+)/(?<api-nanoservice>[a-zA-Z/]+)" ]
                                }
                        }

                        mutate { gsub => [ "api-nanoservice", "/$", "" ] }
                        mutate { uppercase => [ "http-verb", "api-platform", "api-brand", "api-nanoservice" ] }
                }

                if [api-nanoservice] =~ /(?i)USER\/DEVICES\// {
                        mutate { replace => { "api-nanoservice" => "USER/DEVICES" } }
                }
                else if [api-nanoservice] =~ /(?i)Page\// {
                        mutate { replace => { "api-nanoservice" => "PAGE" } }
                }

                date {
                        match => ["ts", "yyyy-MM-dd HH:mm:ss.SSS"]
                        timezone => "CET"
                        target => "@timestamp"
                        remove_field => [ "ts" ]
                }

                fingerprint {
                        concatenate_sources => true
                        source => ["message","beat.hostname"]
                        target => "[@metadata][fingerprint]"
                        method => "SHA1"
                        key => "deduplication-key"
                }

                if [client-ip] and [client-ip] == "" {
                        mutate {
                                remove_field => [ "client-ip" ]
                        }
                }

                if [execution-time] and ([execution-time] == "" or [execution-time] == "-") {
                        mutate {
                                remove_field => [ "execution-time" ]
                        }
                }

                if [ms-result] and [ms-result] == "" {
                        mutate {
                                remove_field => [ "ms-result" ]
                        }
                }

                if [session-id] == "undefined" {
                        mutate {
                                remove_field => [ "session-id" ]
                        }
                }

                if [transaction-id] == "undefined" {
                        mutate {
                                remove_field => [ "transaction-id" ]
                        }
                }

                if [platform] == "undefined" {
                        mutate {
                                remove_field => [ "transaction-id" ]
                        }
                }

                if [api-url] == "undefined" {
                        mutate {
                                remove_field => [ "transaction-id" ]
                        }
                }

                if [client-ip] == "undefined" {
                        mutate {
                                remove_field => [ "client-ip" ]
                        }
                }

                if [additional-data] {
                        mutate {
                                remove_field => [ "additional-data" ]
                        }
                }

                mutate {
                        remove_field => [ "message" ]
                        remove_tag => ["beats_input_codec_plain_applied"]
                }
        }
}

This night I had the rolledover logs shipped from 4 servers with filebeat. That keepsrunning. At noon (20 mins ago) I added a 5th server. I see the offset change in the filebeat registry. So it is processing. However I am not getting any data in elastic.

In the logstash-plain.log on the only running logstash I see connection errors to 1 of the databases. But in elastichq all seems well. I can telnet to port 9200 on that server. and in the logging of that elasticsearch node I see no errors.

The errors in logstash:

[2020-08-25T12:00:07,684][WARN ][logstash.outputs.elasticsearch] Marking url as dead. Last error: [LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError] Elasticsearch Unreachable: [http://pdbs287.grn.prd.itv.local:9200/][Manticore::ClientProtocolException] pdbs287.grn.prd.itv.local:9200 failed to respond {:url=>http://pdbs287.grn.prd.itv.local:9200/, :error_message=>"Elasticsearch Unreachable: [http://pdbs287.grn.prd.itv.local:9200/][Manticore::ClientProtocolException] pdbs287.grn.prd.itv.local:9200 failed to respond", :error_class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError"}

[2020-08-25T12:00:07,688][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch' but Elasticsearch appears to be unreachable or down! {:error_message=>"Elasticsearch Unreachable: [http://pdbs287.grn.prd.itv.local:9200/][Manticore::ClientProtocolException] pdbs287.grn.prd.itv.local:9200 failed to respond", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError", :will_retry_in_seconds=>2}


[2020-08-25T12:00:09,761][ERROR][logstash.outputs.elasticsearch] Attempted to send a bulk request to elasticsearch, but no there are no living connections in the connection pool. Perhaps Elasticsearch is unreachable or down? {:error_message=>"No Available connections", :class=>"LogStash::Outputs::ElasticSearch::HttpClient::Pool::NoConnectionAvailableError", :will_retry_in_seconds=>4}
[2020-08-25T12:00:09,978][WARN ][logstash.outputs.elasticsearch] Attempted to resurrect connection to dead ES instance, but got an error. {:url=>"http://pdbs287.grn.prd.itv.local:9200/", :error_type=>LogStash::Outputs::ElasticSearch::HttpClient::Pool::HostUnreachableError, :error=>"Elasticsearch Unreachable: [http://pdbs287.grn.prd.itv.local:9200/][Manticore::ClientProtocolException] pdbs287.grn.prd.itv.local:9200 failed to respond"}
@

This is the only server I see errors for, so ti should conenct to the other databases. And this pdbs287 is not a output target for logs from the 5 servers I now set online.

WIll try to post graphs, takes a while.

Logstash by default processes event in batches of 125 events if I remember correctly. The fact the you are breaking the batch down into many Elasticsearch outputs mean that it may end up sending lots of small batches which can be quite inefficient. I would recommend either trying to consolidate into a single output plugin or increase the internal batch size in Logstash.

OK, so you actually mean... send all 'else ifs' to the same server? Or skip the whole conditional stuff? If so where can I set those indexname and document id per different type?

(Edited the ouput plugin code, actually posted template instead of real file)

Since I have 1 logstash running now, and 1 type of servcice having 5 nodes that send logs, here's the stats of the ES db that receives those logs.

The gap in the middle is a period where I tried a different output config which did not work.

I added

### Add an indexbasename to create correct indexname.
mutate { 
     add_field => { "[@metadata][indexbasename]" => "agl-api" } 
}

to the filter and changed the outputconfig to:

        if "api-log" in [tags] and "apigateway" in [tags] {
                elasticsearch {
                        hosts => ["pdbs250.grn.prd.itv.local:9200","pdbs251.grn.prd.itv.local:9200","pdbs247.grn.prd.itv.local:9200","pdbs248.grn.prd.itv.local:9200","pdbs249.grn.prd.itv.local:9200","pdbs287.grn.prd.itv.local:9200"]
                        document_id => "%{[@metadata][fingerprint]}"
                        index => "%{[@metadata][indexbasename]}-%{[@metadata][version]}-%{+YYYY.MM.dd}"
                        sniffing => false
                }
        }

Will see if after the log rollover it does create nice correct indexes again. If zo, I can add the field to all filters and remove the seperate output plugins.

However, I think that is not the definitive solution since I have 5 servers running now and it still is slow as ***

That is how to do it, but it may as you say not be the only issue. It will however not hurt performance.

I have 3 logstashes. Each will then go back to 1 plugin. If I supply multiple hosts, does it just pick the 1st one or is this loadbalanced? Does it make sense to name all datanodes thereor is it better to give all 3 logstashes 1 different datanode to send data to?

OK,

At this moment all 32 log supplying machines for this servertype are running again. I added them in phases from 2 to 6 servers at the same time, had them indexed and then the next. Graphs of 1 of the servers during these hours:

Output config of logstash is changed like above. Have 3 logstashes back running, only difference between the 3 is the order of the servers in the hosts file (could not find if this does anything and how the outplugin handles those 6 nodes anyway).

Wel all 32 nodes are shipping logs again. So far so good. I gradually added them and this night went appearantly OK. Today I wall start to gradually add other services as well.

Unfortunately, after I earlier today restarted logshipping from our varnish caches as well, this evening after 20.00 hr, things started to break again. (Evening between 18:00 and 20:00 hr are are busiest hours).

So I switched off the filebeats of the cache again to see if after log rollover things for the first service recover tomorrow.

I made screenshots of the advanced pages of my db's.
First is just an illustration. this graph shows the nrof filebeats that are recoginized in the index.You see them gradually decrease.
2nd and 3rdare the indices and overview screen under monitoring. The others are the advanced pages under monitoring for my 6 DB nodes. I don't really see strange things besides the latency graph?