Change thread pool search queue_size? yes or not?

Got one question. Working on my elastic stack. Basically it's "developing production".

Got one server with 16GB of RAM, 4 CPUs, ELK 5.4.0. No cluster or extra nodes.

Got no problems with adding data and searching, the problems gets with dashboards with multi visualizations.

I figure, the problem is probably CPU?

With development, also the document size increased as I try to automate as much as possible. So indexes are created daily with around 5-6 mio documents per index. So with few months of data I'm already at several 100 millions of documents.

So when I open a dashboard with multiple visualisation I receive a timeout error in kibana and elasticsearch log says:

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.transport.TransportService$7@59b0d1f1 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@395e3b7a[Running, pool size = 7, active threads = 7, queued tasks = 1000, completed tasks = 3105230]]

So i'm thinking of maybe increase the search queue_size. Would that even make sense? What would be the optimal size in my situation? Best would be probably to add two more nodes and create a cluster.

Thanks!

How many indices and shards are you creating per day? What is your average shard size?

1 index and 1 shard per day. Size of data in index is around 2 GB per index.

That sounds quite reasonable. How many visualisations do you have on the affected dashboards? Do you have X-Pack Monitoring installed? What does CPU usage and disk I/O look like when you experience the timeout?

5 visualization. Yes, x-pack is installed. The only thing I really see out of the ordinary is the System load which rose over 8.00.

Then the elasticsearch just quit/killed for few minutes.

So my asumption is lackage of CPU?

What is your heap size? What does GC graph look like in monitoring?

-Xms8g
-Xmx8g

GC count young jumped from 8 to 23, duration from 230 ms to 781 ms, Cgroup CPU utilization from 26% to 66% and Cgroup usage from 35b ns to 53.7b ns.

Other graphs had no significant changes.

You have any tips or suggestions?

tnx

How large are your indices/shards? How many are you querying across when you encounter problems?

Indices are on average around 2,5 GB with 6mio documents, created daily. Default search/overview is last 24 hours.
Increased the cpu cores, but still encounter time outs. Even with only 2-3 visualizations on dashboard.

What type of storage do you have? What does disk I/O and iowait look like? How many concurrent queries?

Looks like a disk reading issue...
As soon as I open dashboard the values increase...

From:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 0.80 0.00 8.00 20.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 2.20 0.00 16.20 14.73 0.00 0.55 0.00 0.55 0.36 0.08

To:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 659.20 36.80 25.40 946.40 2744.00 118.66 0.29 4.73 4.66 4.83 1.54 9.60
sdb 0.00 0.40 3645.00 11.00 1582319.20 46.70 865.63 130.93 35.73 35.77 20.13 0.27 99.98

Yes, it looks like your storage is indeed the bottleneck.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.