How to filter docs by text field length?

There is an ElasticSearch 7.3.2 instance, with index containing "message" field of text type.

I need to get all docs with message length longer than 100000 chars (1million):

curl -XPOST -H 'Content-Type: application/json' -d '
{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": {
            "source": "doc['message'].toString().length() > 10000,
            "lang": "painless"
          }
        }
      }
    }
  },
  "script_fields": {
    "A": { "script": { "lang": "painless", "source": "params._source.message.toString().length()" } },
    "B": { "script": { "lang": "painless", "source": "doc['message'].toString().length()" } },
    "C": { "script": { "lang": "painless", "source": "doc['message'].length" } },
    "D": { "script": { "lang": "painless", "source": "doc['message'].size()" } }
  }
}
' "127.0.0.1:9200/myindex/_search?pretty=true"

Result:

"fields" : {
  "A" : [ 2155780 ],
  "B" : [ 13206 ],
  "D" : [ 1514 ],
  "C" : [ 1514 ]
}

Why all they are different?
What should be used in query filter?
I need value from "A", but I cannot use it in query filter because it uses _source var that's unknown in filter script.

Using scripted fields is quite the performance killer in this example. The most performant solution to this would be to use an ingest pipeline with a script processor that stores the length of the string in a dedicated field.

The reason for the different values is that some data gets retrieved from the inverted index (where it already might have been shortened when there are long fields) where as the first one uses the JSON source.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.