Hello,
We have an index that get ~15M documents a day, each with a epoch milliseconds timestamp.
Date aggregations seems to work fine, but the bigger the index - the longer the query.
One action would be to separate the index per day, which will allow us flexibility and much better performance.
Another idea was to "round" the milliseconds timestamp into minutes, so instead of each mseconds, it will be indexed in its minute-related msecond.
My question is - would it make performance better ?
Considering the related field index would be much smaller (x60000 smaller) - would it make the query faster ?
I failed to find any example like that, nor recommendation for such an action.
Regards,
Shushu
If your use case is able to round dates at index time to minute resolution then this would definitely be a good thing to do as it would reduce the index size as you said. The index size saving would be because there would be less terms in the inverted index but also because the gcd compression used in doc values and the compression of the source field would both be more efficient.
In terms of query performance I am not sure you would see a lot of improvement. In 2.x numeric fields (date fields are actually indexed as long fields) use trie encoding to enable faster range querying. This indexes each value as multiple terms at different resolutions (e.g. in a base10 trie encoding 124 could be indexed as 124, 120 and 100). This means that at query time we can minimise the number of terms we need to search can be minimised by using these different levels of resolution. Rounding your values to the nearest minute will mean you have less terms at the lowest levels (hence the reduction in inverted index size) but on most ranges only a few of these terms will be used anyway. You may see an increase in query performance for small ranges where the proportion of these low level terms is high compared with the total number of terms used for the query, but I would have thought that for long ranges you would not see a significant performance increase since the proportion of these low level terms used would be small.
So the upshot of this is that you should do this if you can because you should see a good reduction in index size, but you may not see any change in query performance.
Thanks !
It helps, though it just means I rather not spend time on this, since my main goal was to enhance query performance.
It is cool to know Elastic is built-in with those kind of capabilities.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.