Index Sorting and terminate_after combination

I have an index in ES 6.6 used to search regular content that should be scored differently based on the user. Today we have more than 50 fields that can be used to calculate this score.

In order to speed up the query, I decided to use early termination like described here

I created a field based on a combination of all individual fields used for scoring and sorted the index with this value.
However, I can't sort the results of the query based on this single field (since the sort order will depend on each user), so I started using the terminate_after parameter in the query.
After doing some evaluations I noticed that terminating the query after 1000 elements was good enough to have a set of 10 results with the optimal score for each user.

This seemed to work well and I got the speed up that I wanted, however later I realized that this only works if each shard in the index has only one segment. I call force_merge with value 1 during index creation, but I can't do that for index updates during normal traffic.

After updating the index, new segments are created and they are completely ignored by the query if we reach enough matches for the terminate_after parameter in the original segment.
This happen even if in the new segment, the field that defines the index sorting would position the document as the top result.

My questions now:

  • Is that a bug in ES or this is a mix of features that shouldn't really not work together?

  • Is there an alternative to get the benefits of early termination but to be able to resort the results by different fields in a later stage?

I put together a script that can be executed in kibana to give an example of the issue:

PUT the-index
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 1
    },
    "sort": {
      "field": "combined_score",
      "order": "desc"
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "content": { "type": "text" },
        "score_field_1": { "type": "integer" },
        "score_field_2": { "type": "integer" },
        "score_field_3": { "type": "integer" },
        "combined_score": { "type": "integer" }
      }
    }
  }
}

PUT /the-index/doc/id1
{
  "content": "content 1",
  "score_field_1": 10,
  "score_field_2": 40,
  "score_field_3": 100,
  "combined_score": 150
}

PUT /the-index/doc/id2
{
  "content": "content 2",
  "score_field_1": 90,
  "score_field_2": 40,
  "score_field_3": 10,
  "combined_score": 140
}

PUT /the-index/doc/id3
{
  "content": "content 3",
  "score_field_1": 80,
  "score_field_2": 40,
  "score_field_3": 10,
  "combined_score": 130
}

PUT /the-index/doc/id4
{
  "content": "content 4",
  "score_field_1": 70,
  "score_field_2": 40,
  "score_field_3": 10,
  "combined_score": 120
}

PUT /the-index/doc/id5
{
  "content": "content 5",
  "score_field_1": 60,
  "score_field_2": 40,
  "score_field_3": 10,
  "combined_score": 110
}

GET /the-index/_segments

POST /the-index/_forcemerge?max_num_segments=1

#returns correct result id1 as top hit
GET /the-index/_search
{
  "terminate_after": 2,
  "size": 1,
  "query": {
    "function_score": {
      "query": {
        "match": {
          "content": "content"
        }
      },
      "field_value_factor": {
        "field": "score_field_3"
      }
    }
  }
}

# index id1 again, this will create a new segment
PUT /the-index/doc/id1
{
  "content": "content 1",
  "score_field_1": 10,
  "score_field_2": 40,
  "score_field_3": 100,
  "combined_score": 150
}

#returns incorrect result id2 now as top hit
GET /the-index/_search
{
  "terminate_after": 2,
  "size": 1,
  "query": {
    "function_score": {
      "query": {
        "match": {
          "content": "content"
        }
      },
      "field_value_factor": {
        "field": "score_field_3"
      }
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

The query phase visits each segment independently but terminate_after is a global count per shard so there's no guarantee that you'll see a sorted stream even if the index is sorted. As you already noticed it works if you force merge all shards to 1 segment but that's just a side-effect of the fact that the index is sorted, not a bug since terminate_after does not change the way documents are collected.
Today the only way to achieve what you want would be to retrieve an extended top N sorted by the index sort criteria and to sort/prune this result client-side using the value of a field extracted from the _source or from the doc values. This is not ideal so I think it's worth opening a feature request in github to discuss a full solution that would not require extra logic in the client. Could you open one ?