Performance hit due to sort over date field

Hi,

We are trying to optimize query that should fetch top 50 newest documents from a large data set. That data set has 60,000,000 documents (filtered) across 92 indices, 7377 shards and 48 nodes. At that time all nodes in cluster perform really bad. CPU is idle, Load is very high, IO wait is very high, indexing latency increases almost double and indexing drops.

Query looks like this:

{
  "query": {
    "bool": {
      "filter": [
       //some filtering
      ],
      "must_not": [
       //more filtering
      ],
      "disable_coord": false,
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "sort": [
    {
      "date_field": "desc"
    }
  ],
  "_source": [
    //list of fields
  ],
  "size": 10,
  "from": 0
}

Any hints that would help us optimize the query?

If you know by any chance that you want to sort the most recent documents and that the most recent documents most likely happened over the last week or day, then you should may be first filter for the most recent event (a day or a week).
It will make the sort more efficient I believe as less documents will have to be sorted.

Also may be you don't need to reach all indices. If those are time based data, may be you should only look at the most recent indices...

But more than that, you should consider index time sorting with https://www.elastic.co/guide/en/elasticsearch/reference/7.1/index-modules-index-sorting.html. That should improve a lot the sort.

1 Like

Since we have daily indices we could filter the data by indices but it increases a complexity of a service that we use for searches.
We know about index time sorting but we are using nested fields so it won't work for us
Does Elasticsearch have some sorted data structure that could have been saved to disk which it will use in query time? Something like index in SQL. In that case we could request TOP 50 documents from data set of 60M and it will work fine since it will return only top 50 (presorted) documents from each shard, right?

This looks like to me the index sorted feature.
More on this at: https://www.elastic.co/fr/blog/index-sorting-elasticsearch-6-0

About nested data, I don't see the point. Do you mean that your date field is in nested documents?

No, date is not nested, I think I misunderstood the warning in the documentation. It says: An error will be thrown if index sorting is activated on an index that contains nested fields. so I thought we cannot use sorting if any of the fields is nested in the index

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.