I'm hitting the two-minute timeout on Elasticsearch when trying to query just the past 12 hours of data. I'm curious on what I can do to increase query speed. I have basically unlimited hardware available for additional data, master, or client nodes. However, with the current setup none of these nodes are getting hit very hard, yet the cluster still times out.
We have two clusters, each producing a daily index. Daily Index Stats:
171M docs
500B/doc
72GB
2 Primaries, 2 replicas: 6 indices in total
6 data nodes, 3 masters per cluster
2 clients in one cluster - querying both the local and remote clusters.
Data/Client Node config:
32 cores
31 GB RAM (verified to still have zero-based compress OOPs) on docker
256GB ram on physical server
Data nodes are additionally set to: ingest:true in order to enable monitoring.
Happy to share additional specific configs.
Do I need to add additional client nodes? I can't imagine I'd need to add additional data nodes..? I'm okay to have slow queries - it's somewhat expected, but it feels so slow such that something must be incorrectly set.
What kind of queries are you running? I can see that you have quite a few non-standard settings in your configuration and wonder if that could be related?
How many indices and shards do your queries target? How many concurrent queries do you need to support? What level of concurrent indexing is taking place? Is that 72GB the size of the primary shards for each of the daily indices or all of them?
I can see that you have quite a few non-standard settings in your configuration and wonder if that could be related?
Unfortunately, we were encountering slow query speeds prior to modifying the thread_pool configs. I'm happy to change them back, but we're hitting a timeout at 120s currently that makes benchmarking extremely difficult.
How many indices and shards do your queries target?
The aforementioned query will target about 14 daily indices on two clusters, at about 281 shards being queried. A typical dashboard will run about 3-5 concurrent queries. We don't expect more than 1 or 2 users to be on the system at a time.
What level of concurrent indexing is taking place?
mostly just a single index is being indexed, at about 4k eps.
Is that 72GB the size of the primary shards for each of the daily indices or all of them?
72GB is the size of just the primary shards. Considering increasing to 3 shards / 1 replica due to the size. Thoughts? Do I need to take into consideration that this is over ~468M docs?
What is your retention period?
Retention period is 90 days, however we're looking at ways to rollup the data after 1 month, simply due to query time
We are using SSDs in a JBOD configuration. I can do some testing to check iowait and disk utilization today.
The more immediate issue we're facing is that we can't run queries for longer than 2 minutes. The team that owns this cluster is pointing to a Kibana timeout (issue crossposted to Kibana forums) they say they can't modify until 7.4. Does that sound correct? I believe I've ran >2 minute long queries on previous Elastic stacks without issue..
@Christian_Dahlqvist It's interesting - I'm not seeing any reads on datanodes or the client node when I run a large query. However writes (and indexing) are occurring just fine:
Is the cluster state in newer versions of ES still a big deal? For one cluster, the cluster state is 20MB, for the other cluster, it's probably double that (still curling to a file to determine overall size). Furthermore, when I try to run GET _cluster/state, it breaks the dev console every time..
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.