We are trying to optimize query that should fetch top 50 newest documents from a large data set. That data set has 60,000,000 documents (filtered) across 92 indices, 7377 shards and 48 nodes. At that time all nodes in cluster perform really bad. CPU is idle, Load is very high, IO wait is very high, indexing latency increases almost double and indexing drops.
If you know by any chance that you want to sort the most recent documents and that the most recent documents most likely happened over the last week or day, then you should may be first filter for the most recent event (a day or a week).
It will make the sort more efficient I believe as less documents will have to be sorted.
Also may be you don't need to reach all indices. If those are time based data, may be you should only look at the most recent indices...
Since we have daily indices we could filter the data by indices but it increases a complexity of a service that we use for searches.
We know about index time sorting but we are using nested fields so it won't work for us
Does Elasticsearch have some sorted data structure that could have been saved to disk which it will use in query time? Something like index in SQL. In that case we could request TOP 50 documents from data set of 60M and it will work fine since it will return only top 50 (presorted) documents from each shard, right?
No, date is not nested, I think I misunderstood the warning in the documentation. It says: An error will be thrown if index sorting is activated on an index that contains nested fields. so I thought we cannot use sorting if any of the fields is nested in the index
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.