First post here, hoping to get some assistance with this issue, or tell me if this is normal behavior.
Background:
- We have a cluster of 3 nodes with 32Gb each running on VM / docker. Each node has 16Gb allocated to Java heap.
- Our ingest is about 10Gb a day into a daily index with three primary shards, and 1 replica.
- Indexing seems fine, with a rate of about 500/s across all shards, with a latency of around 0.7ms
- We have one dashboard with about 18 visualisations (9 bar graphs, 9 tables) on it
- When viewing the dashboard, the search rate, and search latency can't be seen because kibana stops responding (i know we should monitor from a different cluster for exactly this reason).
- Rough numbers during report generation are client response times of around 25-30s, HTTP connections around 80, client requests around 175
Issue:
The dashboard takes about 1 minute to load fully, and there is only 10 days (10x10Gb of logs) of data to aggregate. We want this to be a monthly report, so if it takes 3 minutes to do 30 days of data (300Gb), it seems too long.
Observations:
When looking at the stack monitoring, the system load for each node goes to about 4-6 for each node during report creation.
When looking at the Network waterfall for the data requests - the stalled state increases from first visualisation to last- ie the first visualisation is in a stalled state for 9 seconds and the last in a stalled state for 48 seconds. The TTFB is about 15s across the board:
Queued at 384.48 ms
Started at 385.46 ms
Resource Scheduling TIME Queueing
0.98 ms
Connection Start TIME Stalled
47.49 s
Request/Response TIME Request sent
0.17 ms
Waiting (TTFB)
16.41 s
Content Download
Firstly, is this normal behavior with the stalled state?
Secondly, if not, how to determine whats causing this stalled state?
Cheers
Rob.