Timed out while getting index list on creating an index pattern in Kibana

We're observing timeout while creating an index pattern in Kibana.
That means an aggregation query below takes over 30 seconds.
{"size":0,"aggs":{"indices":{"terms":{"field":"_index","size":200}}}}

A curl command that runs the query above took around 40 seconds actually.

The ES cluster contains 650+ TB of data over 25,000+ primary shards, indexing 1+M documents/s constantly.

As suggested in a discussion thread, it can be acceptable to increase request timeout.

But, do we have any chance to improve the query performance?

Any chance to benefit from eager_global_ordinals?

Profile result

Shard query time average: 528 ms, Shard query time max: 7,319 ms
Rewrite time average: 2.3 us, Rewrite time max: 1,550 us
Collect time average:1,642 ms, Collect time max: 9,589 ms
Aggregation time average: 671 ms, Aggregation time max: 16.5 s

Some TODOs

  • Try forcemerge.
  • Increase data size per shard.

Versions

Elasticsearch: 6.6.1
Kibana: 6.6.1

A curl command that runs the query above took around 40 seconds actually.

Are you running the query on all indices? Or on a dedicated one?

The ES cluster contains 650+ TB of data over 25,000+ primary shards, indexing 1+M documents/s constantly.

That's huge. How many nodes do you have? How many shards in total?

Are you running the query on all indices? Or on a dedicated one?

The query on all 250+ indices.

How many nodes do you have? How many shards in total?

Building HOT-WARM architecture with 130 HOT data nodes and 140 WARM data nodes.
The number of replicas is set to 1. The cluster contains 55,000+ shards including replicas.

So you are running a query against 25000 shards? Is that something you really want to do?

I mean that it takes a lot of time to gather the data (like 200*25000 = 5.000.000 documents) to a single node. If a single document is let say 1kb, that's a huge amount of data to transit over the network, then to merge to just get the 200 most relevant documents.

Also when querying all data you are querying both warm and hot nodes. I guess that warm nodes are not using SSD drives...

So you should control better which indices you are actually querying.
If you have a logging use case, may be you should create a weekly alias which only query the last 7 days of data and a monthly for the last 30 days...

So you should control better which indices you are actually querying.

When creating an index pattern in Kibana, Kibana sends that query over all the indices, right?
Can we control the query somehow?

In Kibana, you should use the weekly alias to define your index pattern.

But may be that's not your question?

In Kibana, you should use the weekly alias to define your index pattern.

I understand we can use a weekly alias to reduce search target shards.


When opening up 'create_index_pattern_wizard' in Kibana, Kibana tries to list up available indices.
At that time, Kibana sends a terms aggregation query over all the indices.
In our deployment, that request fails due to timeout.

At that scale I would probably recommend splitting this large cluster into a number of slightly smaller clusters instead and use cross-cluster search to query across them, especially if you are expecting the size if the cluster to grow.

At that scale I would probably recommend splitting this large cluster into a number of slightly smaller clusters instead and use cross-cluster search to query across them, especially if you are expecting the size if the cluster to grow.

I see. We'll consider that option.
(This is off-topic, but how can cross-cluster search help improving search performance?)

But, still I hope there's something we can do for the current deployment. Any idea?

It does reduce the size of the cluster state(s), which is generally good and makes the it easier to manage, which is usually the main reason to start having multiple clusters. It also changes how requests are processed as there is an additional step in gathering the data from the different clusters, but I am not sure how much impact it will have on request processing.

650TB over 55,000 comes to an average shard size of around 12GB. It might be worthwhile trying to increase this in order to reduce the overall shard count.

It does reduce the size of the cluster state(s), which is generally good and makes the it easier to manage, which is usually the main reason to start having multiple clusters. It also changes how requests are processed as there is an additional step in gathering the data from the different clusters, but I am not sure how much impact it will have on request processing.

Thank your for your explanation!
We're observing some monitoring tool gets slow down with this size of cluster.

650TB over 55,000 comes to an average shard size of around 12GB. It might be worthwhile trying to increase this in order to reduce the overall shard count.

Yep, I'll try.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.