Performance hit due to sort over date field

mats990 · May 24, 2019, 1:40pm

Hi,

We are trying to optimize query that should fetch top 50 newest documents from a large data set. That data set has 60,000,000 documents (filtered) across 92 indices, 7377 shards and 48 nodes. At that time all nodes in cluster perform really bad. CPU is idle, Load is very high, IO wait is very high, indexing latency increases almost double and indexing drops.

Query looks like this:

{
  "query": {
    "bool": {
      "filter": [
       //some filtering
      ],
      "must_not": [
       //more filtering
      ],
      "disable_coord": false,
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "sort": [
    {
      "date_field": "desc"
    }
  ],
  "_source": [
    //list of fields
  ],
  "size": 10,
  "from": 0
}

Any hints that would help us optimize the query?

dadoonet · May 24, 2019, 3:00pm

If you know by any chance that you want to sort the most recent documents and that the most recent documents most likely happened over the last week or day, then you should may be first filter for the most recent event (a day or a week).
It will make the sort more efficient I believe as less documents will have to be sorted.

Also may be you don't need to reach all indices. If those are time based data, may be you should only look at the most recent indices...

But more than that, you should consider index time sorting with https://www.elastic.co/guide/en/elasticsearch/reference/7.1/index-modules-index-sorting.html. That should improve a lot the sort.

mats990 · May 25, 2019, 6:41am

Since we have daily indices we could filter the data by indices but it increases a complexity of a service that we use for searches.
We know about index time sorting but we are using nested fields so it won't work for us
Does Elasticsearch have some sorted data structure that could have been saved to disk which it will use in query time? Something like index in SQL. In that case we could request TOP 50 documents from data set of 60M and it will work fine since it will return only top 50 (presorted) documents from each shard, right?

dadoonet · May 25, 2019, 7:18am

This looks like to me the index sorted feature.
More on this at: https://www.elastic.co/fr/blog/index-sorting-elasticsearch-6-0

About nested data, I don't see the point. Do you mean that your date field is in nested documents?

mats990 · May 25, 2019, 9:52am

No, date is not nested, I think I misunderstood the warning in the documentation. It says: An error will be thrown if index sorting is activated on an index that contains nested fields. so I thought we cannot use sorting if any of the fields is nested in the index

system · June 22, 2019, 9:52am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Optimizing response time on always sorted queries Elasticsearch	2	469	July 5, 2017
Query performance Elasticsearch	1	301	July 6, 2017
Query Optimization Elasticsearch	2	437	November 4, 2020
Unbelievable performance improvement with new Index Sorting feature Elasticsearch	6	5745	April 24, 2018
Automatic skipping of indexes / shards for date-based indexing and index sorting Elasticsearch	3	300	November 22, 2022

Performance hit due to sort over date field

Related topics