Improving Speed to Query Millions of Small Documents

ES, Kibana Version 7.3.1

I'm hitting the two-minute timeout on Elasticsearch when trying to query just the past 12 hours of data. I'm curious on what I can do to increase query speed. I have basically unlimited hardware available for additional data, master, or client nodes. However, with the current setup none of these nodes are getting hit very hard, yet the cluster still times out.

We have two clusters, each producing a daily index.
Daily Index Stats:
171M docs
500B/doc
72GB
2 Primaries, 2 replicas: 6 indices in total
6 data nodes, 3 masters per cluster

2 clients in one cluster - querying both the local and remote clusters.

Data/Client Node config:
32 cores
31 GB RAM (verified to still have zero-based compress OOPs) on docker
256GB ram on physical server

Master Node config:
32 cores
8 GB RAM

Data/Client/Master Node modified settings:

bootstrap.memory_lock: true
network.host: 0.0.0.0
http.host: localhost
http.max_header_size: 32kB
gateway.recover_after_master_nodes: 2
action.destructive_requires_name: true

indices.query.bool.max_clause_count: 8192
search.max_buckets: 100000

thread_pool.write.queue_size: 2500
thread_pool.search.queue_size: 4000
thread_pool.search.min_queue_size: 4000
thread_pool.search.max_queue_size: 10000
thread_pool.search.target_response_time: 15s

reindex.remote.whitelist: ["*.*.*.*:*"]
script.painless.regex.enabled: false

xpack.ml.enabled: false
xpack.monitoring.collection.enabled: true
xpack.monitoring.elasticsearch.collection.enabled: true
xpack.watcher.enabled: false

Data nodes are additionally set to: ingest:true in order to enable monitoring.

Happy to share additional specific configs.

Do I need to add additional client nodes? I can't imagine I'd need to add additional data nodes..? I'm okay to have slow queries - it's somewhat expected, but it feels so slow such that something must be incorrectly set.

What kind of queries are you running? I can see that you have quite a few non-standard settings in your configuration and wonder if that could be related?

How many indices and shards do your queries target? How many concurrent queries do you need to support? What level of concurrent indexing is taking place? Is that 72GB the size of the primary shards for each of the daily indices or all of them?

What is your retention period?

What kind of queries are you running?

I am analyzing netflow data using elastiflow dashboards in Kibana. Example query.

I can see that you have quite a few non-standard settings in your configuration and wonder if that could be related?

Unfortunately, we were encountering slow query speeds prior to modifying the thread_pool configs. I'm happy to change them back, but we're hitting a timeout at 120s currently that makes benchmarking extremely difficult.

How many indices and shards do your queries target?

The aforementioned query will target about 14 daily indices on two clusters, at about 281 shards being queried. A typical dashboard will run about 3-5 concurrent queries. We don't expect more than 1 or 2 users to be on the system at a time.

What level of concurrent indexing is taking place?

mostly just a single index is being indexed, at about 4k eps.

Is that 72GB the size of the primary shards for each of the daily indices or all of them?

72GB is the size of just the primary shards. Considering increasing to 3 shards / 1 replica due to the size. Thoughts? Do I need to take into consideration that this is over ~468M docs?

What is your retention period?

Retention period is 90 days, however we're looking at ways to rollup the data after 1 month, simply due to query time

What kind of storage are you using? Local SSDs? Have you monitored disk utilisation and iowait when you are querying?

We are using SSDs in a JBOD configuration. I can do some testing to check iowait and disk utilization today.
The more immediate issue we're facing is that we can't run queries for longer than 2 minutes. The team that owns this cluster is pointing to a Kibana timeout (issue crossposted to Kibana forums) they say they can't modify until 7.4. Does that sound correct? I believe I've ran >2 minute long queries on previous Elastic stacks without issue..

@Christian_Dahlqvist It's interesting - I'm not seeing any reads on datanodes or the client node when I run a large query. However writes (and indexing) are occurring just fine:

$ sudo iotop -n 1 -b -o | awk '{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11 $12}'
Total DISK READ : 0.00 B/s | Total DISK WRITE :3.57
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 2.47 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
4746 be/3 root 0.00 B/s 0.00 B/s 0.00 % 2.21 %[jbd2/sdb3-8]
127195 be/4 udocker 0.00 B/s 310.77 K/s 0.00 % 0.44 %java
127196 be/4 udocker 0.00 B/s 1204.23 K/s 0.00 % 0.33 %java
126888 be/4 udocker 0.00 B/s 38.85 K/s 0.00 % 0.22 %java
126891 be/4 udocker 0.00 B/s 38.85 K/s 0.00 % 0.22 %java
126889 be/4 udocker 0.00 B/s 38.85 K/s 0.00 % 0.22 %java
127172 be/4 udocker 0.00 B/s 77.69 K/s 0.00 % 0.14 %java
127199 be/4 udocker 0.00 B/s 38.85 K/s 0.00 % 0.12 %java
125649 be/4 udocker 0.00 B/s 1903.46 K/s 0.00 % 0.00 %nginx:

Is the cluster state in newer versions of ES still a big deal? For one cluster, the cluster state is 20MB, for the other cluster, it's probably double that (still curling to a file to determine overall size). Furthermore, when I try to run GET _cluster/state, it breaks the dev console every time..

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.