How to avoid the calculation for "hits.total" in search?

The result of search request has a value of "total", how to disable Elasticsearch to calculate the "hits.total"?
The reason is that we have native script filter which is very heavy, we need to avoid unnecessary calculation.

Result sample:
{
"took": 79,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.9581454,
"hits": [
{
"_index": "test",
"_type": "object",
"_id": "001",
"_score": 0.9581454,
"fields": {
"object_name": [
"nested group object"
]
}
}
]
}
}

Query sample:
GET /test/object/_search?pretty=true
{
"size":1,
"query": {
"bool": {
"must": [
{
"match": {
"object_name": "Nested"
}
}
],
"filter": {
"script": {
"script": "nativefilter",
"lang": "native",
"params": {
"user":"user1",
"field": "testfield"
}
}
}
}
},
"fields": ["object_name"]
}

I don't think it's possible and I don't believe it would change your response time.

As the native script filter is quite heavy in my implementation, in most case it would increase the response time if there is no optimization to avoid the calculation.

Sure. It will increase the time but to which degree? 1ms more?

But may be you could share your Native Script so we could may be help to optimize it?

Thank David for the follow-up, you can consider each native script filter will cost 5ms, then the hits.total will have a considerable cost.

you get the total hits essentially for free, because you will need to know which documents matched, calculate their score and then order the result by score. Even if you set "size":1 this still means that all matching documents need to be scored to determine the "best" document to return.

1 Like

Thank Yannick.

However, total hits is not for free according to my understanding and test.

Please confirm this in the following query sample, to count every single item will go through the native script filter, if the native script filter costs 5ms for each item, the cost would be high if we let Elasticsearch to do hits.total calculation.

GET /test/object/_search?pretty=true
{
  "from":0,
  "size":1,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "object_name": "object"
          }
        }
      ],
      "filter": {
        "script": {
          "script": "nativefilter",
          "lang": "native",
          "params": {
            "user":"user1",
            "field": "testfield"
          }
        }
      }
    }
  },
  "fields": ["object_name"]
}

What @ywelsch says is correct. Regardless of whether you use the hits.total you will still need to run the native script on each document to determine if it matches the query just to get the top result. Elasticsearch does not give you the first document that matches your query, it attempts to give you the best document matching your query. In order to find the best matching document you need to know all the documents that match the query so keeping a running total of how many documents match the query as you go along does not affect performance significantly and as @ywelsch said you effectively get this for free.

If you are trying to improve the performance of your query I would instead try to optimise your native script so it's cost per document is reduced.

3 Likes

You can use terminate_after to execute the script fewer times. This might
not give you the most relevant results in general, instead giving you the
most relevant results before it terminated. It isn't always what you want,
but it might be ok for you.

Thank Nik, I have tried the terminate_after parameter, I assume the document set (being sent to the filter clause) has been ordered by score, is this correct? If not, how could we achieve this?

Thank Colin for your detailed comments.

From my understand to your comments, current procedure can be described as below:

  1. Get the document set of the match (or other query) clause
  2. Send the document set to native script filter for filtering
  3. Return the result basing on the score of query clause

However I want a procedure to switch the step #2 and #3 above:

  1. Get the document set of the match (or other query) clause
  2. Order the result set basing on the score of query clause
  3. Send the ordered document set to native script filter for filtering, and stop filtering when the result achieve the page size

How could ElasticSearch achieve this? If not ready, could we add this as a feature backlog for ElasticSearch 5.0?

No it isn't. It is ordered by whatever order Lucene hits the document. I can't think of a good way to make it ordered by score either.

The query rescorer will allow you to use your native script only on the top
n documents returned by the Lucene scorer:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-rescore.html

Ivan

Hit send too early. Probably not what you need, but from my understand,
native scripts are only for scoring, which is post filtering (but not post
post filtering, which is the phase where the query rescorer works).

Yeah, it is close-ish but isn't quite right.

I wonder if post_filter could do the job here. It isn't really built for this but it might do. It is worth investigating.

1 Like

What the OP wants is terminate_after during the post_filter stage. Not
supported AFAIK, but interesting nonetheless. And using a native script as
the filter as well.

Hi Ivan, you are correct, terminate_after will firstly (maybe only? could you please confirm) affect the query, not post_filter.
However I find size can correctly control post_filter.

GET /test/object/_search?pretty=true
{
  "size": 1,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "object_name": "object two"
          }
        }
      ]
    }
  },
  "post_filter": {
    "script": {
      "script": "nativefilter",
      "lang": "native",
      "params": {
        "user": "user1",
        "field": "testfield"
      }
    }
  },
  "fields": [
    "object_name"
  ]
}

Hi Nik, it seems post_filter can work. In 2.3, global filter is the same as post_filter, right?

Yes.

Hi Ivan, rescore might not help me, the result is still returned with "_score": 0

GET /test/object/_search?pretty=true
{
  "size": 5,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "object_name": "object two"
          }
        }
      ]
    }
  },
  "rescore": {
    "window_size": 10,
    "query": {
      "score_mode": "multiply",
      "rescore_query": {
        "function_score": {
          "script_score": {
            "script": {
              "script": "nativefilterrescore",
              "lang": "native",
              "params": {
                "user": "user3",
                "field": "testfield"
              }
            }
          }
        }
      }
    }
  },
  "fields": [
    "object_name"
  ]
}

Result sample:
{
"took": 85,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 14,
"max_score": 1.691042,
"hits": [
{
"_index": "test",
"_type": "object",
"_id": "00000000000000000003",
"_score": 0.10011096,
"fields": {
"object_name": [
"object three"
]
}
},
{
"_index": "test",
"_type": "object",
"_id": "00000000000000000012",
"_score": 0,
"fields": {
"object_name": [
"multiple user group object two"
]
}
},
{
"_index": "test",
"_type": "object",
"_id": "00000000000000000001",
"_score": 0,
"fields": {
"object_name": [
"object one"
]
}
},
{
"_index": "test",
"_type": "object",
"_id": "00000000000000000002",
"_score": 0,
"fields": {
"object_name": [
"object two"
]
}
},
{
"_index": "test",
"_type": "object",
"_id": "00000000000000000004",
"_score": 0,
"fields": {
"object_name": [
"object four"
]
}
}
]
}
}