New install - shard failure with 1 client

So after testing ELK Stack in a test lab with some VMs and a bunch of clients with Winlogbeat, filebeat and metricbeat, I decided today to go for it and put this into my live envrionment.

I configured the stack exactly the same as I did in my lab, using the same notes that let me stand it up 3 times before and everything seemed to work. I put 1 client in with Winlogbeat and it looks fine.

BUT, after an hour I try some searches and I get shard errors? The only difference in my config is this new ELK server (all in one) is higher spec (4x core, 8gb ram and I split the data/logs onto a seperate mount point). Any ideas why?

Index: winlogbeat-2016.02.20 Shard: 1 Reason: {"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$7@4d65a7e6 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@4750cd47[Running, pool size = 7, active threads = 5, queued tasks = 995, completed tasks = 31245]]"}

That indicates that your cluster is overloaded and the internal queues that it holds (threadpools) are full and cannot deal with any more.

Thanks, yeah I couldn't understand why this would happen with a single client, but I may have just found it.

Even though I only had 1 client reporting it, I realized I forgot to flag the winbeat.yml file to only use the last 24hrs, so it pulled in over 2yrs of logs in one go!!! So I stopped that, changed it to 24hrs, removed the resume file and flushed the elasticsearch data with curl -XDELETE 'http://localhost:9200/*'

Its now only got 24hrs of data and its not throwing errors. I guess this was just because it was too much info to injest in one go?

Are there any docs that give recommended specs for this sort of thing? THe place I work has around 250 servers that I want to injest for Wintel EventLogs and Linux Secure/Messages, plus maybe VMware hosts and Cisco switches.

The error refers to the search threadpool, so it's unlikely that ingestion that is impacting this.

What are the specs for the cluster now? How many nodes etc?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.