I'm testing some scenarios of daily vs monthly indices and noticed something that might be counter intuitive,
I have 1 index with 69,000,000 documents 8.4GB and another monthly index (same data, but across many days in one index) of about 2,000,000,000 documents 220GB. I'm testing a query that has 3 aggregations (sums mainly) and I found that there is no real difference in query times when querying the daily index or the monthly index. My query is the same for both and queries a full day.
Does this make sense? I was under the impression that ES will perform better when indices are smaller since it will have a smaller data set to filter from.
My configuration is: 1 node, 32GB memory (16GB is allocated for ES), 5 shards.
I thought that this could mean that ES is mostly busy in the aggregations rather than in filtering out the relevant documents, but if I just perform the time range filter on both indices, I get very similar results (502ms on the large index vs 508ms on the small one)
Elasticsearch doesn't have to look at the documents to figure out if your query selects it. The times are indexed and it reads all the documents that match from the index.
Now that you have both indexes you may want to try stuff like adding other filters to query and see if it still performs well. Both ought to perform fairly similarly no matter what you add.
If you want to see what it would be like for Elasticsearch to have to visit every document to determine if it is a hit or not you can rewrite your date range filter as a script. Scripts don't get to use the index, instead they use column-wise storage of the documents like the aggregations do.
So you're saying that an index of 100,000 documents will perform the same as 100,000,000,000 documents for the same range filter? and aggregations as well as long as I'm aggregating the same amount of documents on both? If so then why bother with daily indices at all? When would daily indices have an impact?
Daily indices are mostly a good idea because deleting by a range query is very inefficient but deleting a whole index is very, very fast.
Secondarily they allow the request cache to kick in more efficiently, though this will only make a big difference in 5.0. In 5.0 Elasticsearch can rewrite requests before the request cache, allowing it to efficiently cache large aggregations across multiple indexes that include the current time. Like if you wanted to aggregate all the results over the last 7 days, 5 of those days always have the same data (the first and last don't). The rewrite lets those stay cached.
Another thing that matters: there is a hard limit of 2^31 documents per shard. You just can't add more than that many to a shard.
The problem with daily indices is that have too many shards (and indices) can make maintaining the system difficult. Indicies have memory overhead both in terms of cluster state and writes and readers. Having daily indices with multi-year retention policies can yield fairly slow cluster state updates. We've been working separately to speed up that cluster state maintenance stuff, but the memory overhead won't vanish no matter what we do.
So 5.0 has two things: rollover and shrink. These were designed to allow you to replace daily indices with something more fluid like "start a new index every week or if the index gets to be 20GB or if the number of documents gets to be 50 billion".
If you look at the architecture of an Elasticsearch/Lucene index, you will notice it is an inverted index. Inverted means, not the documents are indexed, but the terms of the documents. The terms are in a dictionary and the term positions are recorded in a posting list, with constant lookup time once they are loaded into main memory. So if you have have few terms and millions or billions of docs, the search algorithm does not slow down in it's runtime complexity.
Aggregations add a substantial amount of time depending on the type of aggregation. Typically they do not iterate over the documents, that would be too slow - they estimate cardinality for example - but it's hard to comment without seeing your aggregation. This is totally different algorithm from query.
Another factor is the CPU power of the node. You have chosen single node / 5 shards, maybe with 10 or 20 segments each. For most CPUs, this is no challenge, it correlates with the maximum CPU core count. For concurrent execution, the Java VM spawns threads, and that thread count must be available on the node, or search subtasks will get queued. By dividing the work over available threads, search time of a node is in reality bound by the CPU multi tasking power. You can test this by creating e.g. 1000 shards on a node for a single index - beside using more memory, search time will get longer because resources for query executions will exceed maximum ES thread pool size, which is adapted to the available CPU core count.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.