Finding which indices contain hits when making use of an alias in elasticsearch query

I've 1000s of day-wise (time-based) indices in my cluster. Is there a way to monitor or find out at the ES Server side which indices return data when a query makes use of an alias . E.g. consider the below query that my customer fires:

GET index_alias/_search?q=Product_Serial_Num:ABCD

The alias index_alias is mapped to all the day-wise indices in the ES 6.8.6 cluster and they total 1200. The reason being the query is used to fetch historical records for that Product_Serial_Num. Thus it is not possible to know beforehand which day-wise index would have data. That's the reason the index_alias is mapped to all indices. When the customer fires the above query, I would like to monitor this at the ES Server side and find out which indices actually return the data when the query hits them. Is there a way to determine this without modifying anything at the customer end?

The purpose of this is to figure out the "which indices" are being hit the most, like last-n-days (30, 60, 90) and then apply hot-warm architecture accordingly like having last 60 days indices on hot nodes and rest on warm nodes.

Or if there are better ways to determine, please do share.

Thanks.

Might I suggest that thousands of indices is usually not a good idea. I hope you have lots of Heap in your Master :slight_smile:

Why not use index stats to watch your indices and maybe shards, and use that for hot/warm? Why do you care about the shard, anyway, as any hot-warm will be index-level and most loads are balanced across shards (though not all).

I can't see how you can get much data on a query if you cannot modify it.

Thanks @Steve_Mushero for the answer. I've edited my question and description to make it more clear.

I'm basically looking for a way to monitor the queries at the ES Server side and be able to determine which indices are being hit.

The reason for 1000s of indices is to preserve historical data but indexing happens only to today's index. Rest all indices are read-only.

I'm basically looking for a way to monitor the queries at the ES Server side and be able to determine which indices are being hit.

I would still think overall index stats would be the best way, just watch queries or results over time to see which are being used. I'd have to dig into the stats again for more specifics - we added some top index features to our tools, though I don't think released yet (I'm not with Elastic).

The reason for 1000s of indices is to preserve historical data but indexing happens only to today's index. Rest all indices are read-only.

Still a lot and usually people run into Master Heap and other issues as this grows.

Yes. I was thinking to monitor the results of the queries. Would appreciate some recommendations here.

Agree. Which is why want to implement hot-warm arch and then split the alias too into hot and warm alias i.e. the hot alias will hit only hot indices.

Please read this blog post as having large number of small indices and shards can be very inefficient and slow. It is not only heap usage that is affected as the cluster state will grow and get slower to update and propagate, which can also cause a lot of problems.

2 Likes

Hey @Christian_Dahlqvist - yes. fully aware of that blog post and have read it. In my case, each of the day-wise indices are ~30 to 35 GB in size with 1 shard and 1 replica. So far, don't have the issue of too many small shards.

Since I want to implement hot/warm architecture, I'm looking to find the recent "n" indices that should be kept in hot nodes and to mark the remaining indices are Read-only and move them to warm nodes.

There are no indexing issues as of now since indexing only happens to today's index. Rest all indices are RO. However, the search is slow because the customer uses alias and hence it hits all indices (shards). If I'm able to determine the recent "n" indices, then I can map the alias to only those "n" indices and create another alias that will map to warm indices.

How many indices are you indexing into in parallel? What is your retention period? How many aliases do you have and how are these aligned with the indices?

Indexing only 1 index at a time which is today's index e.g. foo_index-2020.09.14. Retention period is 7 years because these are kept for historical purposes.

Have a single alias foo_alias that maps to 1200 day-wise indices pertaining to last 4 years. Customer fires a query like GET foo_alias/_search?q=fieldname:value that hits all indices currently.

Plan to introduce hot-warm arch and change foo_alias to point to n recent indices. The remaining indices will be mapped to a new alias foo_alias_warm. If customer fires a query GET foo_alias/_search?q=fieldname:value that will hit only n recent indices. If the no of hits is 0, then another query GET foo_alias_warm/_search?q=fieldname:value will get fired.

Hi, I'm not sure if I understand your question correctly but if you just want to find out which indices returned hits just use a term aggregation on "_index" field. something like :

"aggs": {
            "groupbyindex": {
               "terms": {"field": "_index"}
              }
         }, 

and then by using "doc_count" you can see which indices are being hit the most

Thank you @borna_talebi for your inputs. My question is more like: If my customer is firing the query and I want to monitor this on ES server side - which indices it hits the most, then how do I go about it?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.