New install - shard failure with 1 client


(Andy) #1

So after testing ELK Stack in a test lab with some VMs and a bunch of clients with Winlogbeat, filebeat and metricbeat, I decided today to go for it and put this into my live envrionment.

I configured the stack exactly the same as I did in my lab, using the same notes that let me stand it up 3 times before and everything seemed to work. I put 1 client in with Winlogbeat and it looks fine.

BUT, after an hour I try some searches and I get shard errors? The only difference in my config is this new ELK server (all in one) is higher spec (4x core, 8gb ram and I split the data/logs onto a seperate mount point). Any ideas why?

Index: winlogbeat-2016.02.20 Shard: 1 Reason: {"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$7@4d65a7e6 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@4750cd47[Running, pool size = 7, active threads = 5, queued tasks = 995, completed tasks = 31245]]"}


(Mark Walkom) #2

That indicates that your cluster is overloaded and the internal queues that it holds (threadpools) are full and cannot deal with any more.


(Andy) #3

Thanks, yeah I couldn't understand why this would happen with a single client, but I may have just found it.

Even though I only had 1 client reporting it, I realized I forgot to flag the winbeat.yml file to only use the last 24hrs, so it pulled in over 2yrs of logs in one go!!! So I stopped that, changed it to 24hrs, removed the resume file and flushed the elasticsearch data with curl -XDELETE 'http://localhost:9200/*'

Its now only got 24hrs of data and its not throwing errors. I guess this was just because it was too much info to injest in one go?

Are there any docs that give recommended specs for this sort of thing? THe place I work has around 250 servers that I want to injest for Wintel EventLogs and Linux Secure/Messages, plus maybe VMware hosts and Cisco switches.


(Mark Walkom) #4

The error refers to the search threadpool, so it's unlikely that ingestion that is impacting this.

What are the specs for the cluster now? How many nodes etc?


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.