Hi,
so I tried to setup multiple Kibana instances in our self-managed environment hoping it will improve alerting and detections performance. Together there are 4 Kibana instances, 2 were setup before (as a service with Debian package) on separate nodes and I added another 2 instances on each of the nodes (by installing from zip and creating separate service), so currently it's 2 virtual machines and both have 2 kibana services running. Each service starts fine, 2 instances on the same host run on another ports, have different UUID and server.name. There is also xpack setup with the same configuration on every instance:
elasticsearch.username: "xxx"
elasticsearch.password: "xxxxx"
elasticsearch.ssl.verificationMode: certificate
elasticsearch.ssl.certificateAuthorities: [ "/etc/kibana/certs/ca.crt" ]
xpack.security.enabled: true
xpack.spaces.enabled: true
xpack.security.encryptionKey: xxx
xpack.reporting.encryptionKey: xxx
xpack.encryptedSavedObjects.encryptionKey: xxx
xpack.task_manager.max_workers: 100
All of the instances connect to the designated controller node, which should (I think) load balance the tasks for alerting and security? From time time to time I can see in the logs from every instance, that they index some new signals, so the problem is probably only with alerts and monitoring. And the log that bothers me is this one (I can find it randomly on any of the instances):
{"type":"log","@timestamp":"2021-03-20T16:32:53+01:00","tags":["error","plugins","alerts","plugins","alerting"],"pid":55264,"message":"Executing Alert \"7458fa4b-0d4c-4e6b-a41d-4f534336a83c\" has resulted in Error: Unauthorized to get a \"monitoring_alert_missing_monitoring_data\" alert for \"monitoring\""}
If it understand it correctly, as for default the instances use index .kibana-xx for scheduling tasks, alerts and so, and it seems to me like they are fighting over the tasks? Or what could be the issue here? Because sometimes I also get this log, from the instances that produced previous log:
{"type":"log","@timestamp":"2021-03-20T16:57:43+01:00","tags":["info","plugins","actions","actions"],"pid":75390,"message":"Server log: Disk usage alert is firing for 3 node(s) in cluster: XXX. Verify disk usage levels across affected nodes."}
What else can I provide to troubleshoot this issue?
I also get this log when starting the instances (but not everytime), don't know if it means something I should worry about:
{"type":"log","@timestamp":"2021-03-19T17:28:57+01:00","tags":["warning","plugins","securitySolution"],"pid":12552,"message":"Unable to verify endpoint policies in line with license change: failed to fetch package policies: missing authentication credentials for REST request [/.kibana/_search?size=100&from=0&rest_total_hits_as_int=true]: security_exception"}
Another problem that bothers me (don't know if it's not connected to the previous issue) is when some of the elastic nodes leaves the cluster (restart or upgrade or anything), I get spammed by this log and can't see the instances in Stack Monitoring until I restart the kibana services (NOTE: we still use legacy monitoring).
"type":"log","@timestamp":"2021-03-19T19:50:41+01:00","tags":["warning","plugins","monitoring","monitoring","kibana-monitoring"],"pid":12552,"message":"Error: Cluster client cannot be used after it has been closed.\n at LegacyClusterClient.assertIsNotClosed (/usr/share/kibana/src/core/server/elasticsearch/legacy/cluster_client.js:195:13)\n at LegacyClusterClient.callAsInternalUser (/usr/share/kibana/src/core/server/elasticsearch/legacy/cluster_client.js:115:12)\n at sendBulkPayload (/usr/share/kibana/x-pack/plugins/monitoring/server/kibana_monitoring/lib/send_bulk_payload.js:22:18)\n at BulkUploader._onPayload (/usr/share/kibana/x-pack/plugins/monitoring/server/kibana_monitoring/bulk_uploader.js:209:43)\n at BulkUploader._fetchAndUpload (/usr/share/kibana/x-pack/plugins/monitoring/server/kibana_monitoring/bulk_uploader.js:195:20)\n at runMicrotasks (<anonymous>)\n at processTicksAndRejections (internal/process/task_queues.js:93:5)"}
{"type":"log","@timestamp":"2021-03-19T19:50:41+01:00","tags":["warning","plugins","monitoring","monitoring","kibana-monitoring"],"pid":12552,"message":"Unable to bulk upload the stats payload to the local cluster"}