Use case: We have a few indexes which have huge amounts of data and they are growing. We need to figure out a way to optimize the search time. We are thinking of using ILM to manage the indexes. But there are a few roadblocks we are facing. Hoping that we can get help here
Is there a way we can be updated when an index is created through ILM? Maybe through a webhook. We need this to keep a list of indexes and when they are created so whenever we need to update an existing document, we can directly update that document in that particular index rather than searching all the indexes. If not, let us know what the best way to achieve this.
What's the best way to optimize the search for a time filter? We don't want to query all the indexes unnecessarily. We are thinking of keeping track of indexes along with created time and filtering on our end to search on the particular indexes which fall under the time filter.
We are using Elasticsearch 6.8 currently.
Also, if things are possible in the latest version, do let us know.
Can you provide us with some specific numbers? How many indices and how big is each index and how many shards per index? There are recommended sizes of shards on average and I want to make sure we are in the same ballpark. As far as number 2 goes, elastic does a great job to know what indices contain what data and you usually don't have to worry about tailoring your search query at the index level. Having shards that are too large (in the hundreds of gigabytes is where you usually run into trouble). With all that said it really depends on the data and cluster, so with out specifics it is hard to be give recommendations. Also, like the bot said you should upgrade the cluster at your earlier convenance. There have been lots of upgrades from 6.8 to 8.6 that you will probably make use of.
We are having 2 indexes above 100 GB and 6 indexes in the range of 50GB to 100 GB. Each of them has 2 shards. How does Elastic figure out which indices have what data?
Elasticsearch uses a routing value on which we compute a modulo to define in which shard a document should go.
By default the routing value is the id of the document.
@dadoonet This should be for creating documents, correct?
I am looking to optimize my searching as I have a time-based query and was thinking to refine the search by specifying indexes on my end. I wanted to know does Elastic uses any method to optimise which indexes/shard to look for while searching. If yes, then how does it do that.
Elasticsearch has had many improvements in search, performance, resource usage and many other aspects in last couple of years, it would be very hard to explain what changed, but you may read the release notes for every version since 6.8 if you want to know what was changed and added.
But in your example, you do not need to specify the exact index that may have your data while search, elasticsearch is able to already do that and from version 7.16 this got even better.
Take a look at the this blog post which has an example of some improvements made.
You should look into upgrade as soon as possible, first you will need to upgrade to the last 7.17 and then you would be able to upgrade to the last 8 version.
@leandrojmp Okay. Thank you
Is there a way to know when ILM creates an index rather than querying the alias details at an interval? We need this to update the old documents.
I do not see the point with ILM in this scenario. It was designed to handle immutable data, which is not what you have.
If you have a timestamp associated with every document that you can access when updating I would use "traditional" time based indices where the time period covered, e.g. a day or a month, is specified as part of the index name.
If you use rollover you have the same problem as with ILM (uses rollover behind the scenes) that you do not know which index data resides in. Instead create monthly indices with the year and month in the name, e.g. index-2023.02. If you have the timestamp of a document elasewhere you know exactly which index to update based on this.
For that exact reason, I am trying to figure out a way in which I can keep a mapping on my own end of index name and creation time while using ILM. Is there a way to find this mapping apart from calling ILM/Rollover explain API at a fixed interval?
I am trying to avoid index creation on my own end as this will add an overhead on the system when I am trying to create a document.
I do not understand what the issue is. The way I described is how time-based indices were managed for years before rollover came into the picture. You would have an index template that applied to all new indices related to a specific pattern, and you would create the index name based on the timestamp in the document when writing it. If you now what the original timestamp is for documents you want to update (kept track of outside Elasticsearch) that is all you need to create the correct index name.
If you have an index template that matches the set of time-based indices I am suggesting, you just derive the correct index name for each document based on the stored timestamp and send a bulk request to Elasticsearch. When Elasticsearch sees you are indexing into a new index, it will automatically create it. You should therefore not explicitly need to create any indices from the application.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.