Dear community I have two simple and quite basic questions regarding search performance in elastichsearch in general:
If I am interested of doing searches in the documents / indicies in a time window of max. the latest 24 hours, does it have an impact on search performance if I have also stored data on the disk older than that, that I do nothing with?
Meaning: If I only search hot data (maxx 24 hours) does it matter in terms of search performance if I have 30 days of 300GB each day staying on the disks too and this amount perhpas is still growing?
Background: I have search performance issues. If I theoretically cut the amount of data from 60 days to 24h would it make a noticable, positive impact on search performance? (under the condition that I only search the latest 24 hours)
Thank you very much for any insight in this!
Kind regards
I am using ES Version 8.14.3. In approx 1 month there will be an update to 8.17.x and I am running a basic licence for now in this cluster.
Yes, these are syslog documents stored in a daily data stream.
cheers
In newer versions Elasticsearch is quite efficient in efficiently ruling out indices that can not contain any matching data based on timestamp range, so I would not expect much difference. You can probably test this by manually running a query with a timestamp filter against all indices and then against only the indices that you expect top find matches in.
Queries against many indices may require data to be fetched from disk so if you have low I/O performance and a lot of indices the storage performance could make a difference.
Hi Christian, thanks for your insinght! Actually, the company where I just started a few months ago is using spinning disks instead of ssd / nvme so I get a low search performance for hot data and I have to come up with a solution of how to compensate a bit this unfortunate setting.
Try the test I mentioned. First query just the newest indices by naming them and then run the same query against all indices irrespective of age. Make sure you are using a timestamp range that just matches the data in the first indices. The second query may hit the cache on the previously queried indices but that is fine as you want to see the potential impact of querying the rest of them. Please share here what the difference in latency is and how many indices were involved in each query.
There are not necessarily any magic solution that fixes slow storage. If you can get a couple of nodes with faster storage I would recommend a hot-warm architecture though.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.