I was doing some research into some possible search performance tuning within my cluster, and I noticed that the Elastic Agent integrations don't leverage sorting the index on @timestamp. In release 7.6, Elastic made a big deal on the performance advances they made when using sorting on date fields. While I understand that at the time this wasn't easily implementable, since beats exist, and since they all used the same index, and with the limitations of sorting (can't use nested fields), this would be pretty much impossible; with the implementation of Integrations and data streams, why hasn't @timestamp been applied at the individual data stream level?
While I don't have any real performance numbers to determine whether the trade-off of implementing sorting on @timestamp is worth it, I'm more just curious if this has even been considered for Integrations?
Hi @BenB196 Thanks for bringing this up. We are indeed running quite a few benchmarks around exactly this problem and plan to follow up on the data streams to get all the benefits of it.
@jpountz I tried to find a related issue maybe you can help linking to a public on?
I don't think there's a public issue about it but I can give some context.
The 7.6 performance improvements that you linked actually do not require indices to be sorted by @timestamp. This is a different optimization that leverages the index in order to skip over non-competitive documents. For instance if you're looking for the 10 most recent documents sorted by @timestamp, and you just saw 10 documents that all had a @timestamp that was on 2021-09-15 or later, then you can use the index to exclude all documents that have an older timestamp. This optimization only works when you don't have aggregations, so there is ongoing work on the Kibana team to update Discover to run separate requests to fetch the most recent logs and to compute the date_histogram of matches (easy in theory, a bit less in practice, which is why it is taking some time), so that we could take advantage of this optimization.
As to whether we should sort data streams by @timestamp this would be a good question too. It would certainly give an even greater speedup over the aforementioned optimization. There is a trade-off there given that index sorting makes indexing a bit slower, and many of our users care a lot about their indexing rate. In my opinion this 7.6 optimization (which we iterated on since the 7.6 release), which doesn't require index sorting, is probably a good default since it should give good search performance in Discover in the future without requiring users to add index-time overhead.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.