How to find number of characters in a text field

nestor1 · July 20, 2022, 6:23pm

I have an application that is writing many documents in Elasticsearch. After a year or two, I figured out that some of these documents are scams because they contain large strings (more than 10000 characters). E.g. one document looks like this:

Message: "bnfgbjkcywcbyftetzodbpgipcdoxgedjxqbfmcjiwlkceyehnwpwhlcfpbivaflaphvlplgeqirctmdyyoasqhhgfopvktgeupughwrteqadrlcmeauxktggoopycijrwenoesdtewvkgsdhafptepxqfidgdpjozvqafbkkshoiokaosqypwxpmttgzntpbdnk...[ up to 10000 characters ]"

Mapping of the Message field is set as a Text and as a keyword. In cases when that kind of document gets in the index, only the Text field is mapped, because the keyword has "ignore above 256" setting enabled by default.

I wanted to check documents with a Message field value larger than 1000, but I can't apply the script because the Message.keyword does not exist, only the Message (as a Text).

GET my-index/_search
{
  "query": {
    "bool": {
      "filter": {
        "script": {
          "script": {
            "source": "doc['Message'].value.length() > 1000",
            "lang": "painless"
          }
        }
      }
    }
  }
}

will throw an error that the Text fields are not optimized for operations that require per-document field data like aggregations and sorting.

If I put doc['Message.keyword'].value.length() > 1000 I get an error cause those documents don't have keyword (because it's larger than 256).

Any idea how else can I check these documents that contain only 1 long string (greater than N number of characters), I would like to reindex into a new index without those documents, and apply proper mapping so in the future those documents will be rejected by Elastic.

Thanks!

stu · July 20, 2022, 6:59pm

Any idea how else can I check these documents that contain only 1 long string (greater than N number of characters)

Hi @nestor1,
Because there's no keyword field, you'll have to hit the source.

While the filter context does not have source access, the runtime field context has access to source via params[_source].

You can use a boolean runtime field along with a term query to perform the logic you want.

GET my-index/_search
{
  "runtime_mappings": {
    "tooLong": {
      "type": "boolean",
      "script": {
        "source": """
        def msg = params['_source'].get('Message');
        if (msg instanceof List) {
          emit(msg.size() == 1 && msg[0].length() >= params.maxSize);
        } else if (msg instanceof String) {
          emit(msg.length() >= params.maxSize);
        } else {
          emit(false);
        }
        """,
        "params": {"maxSize": 1000}
      }
    }
  },
  "query": {
    "term": {
      "tooLong": true
    }
  }
}

nestor1 · July 20, 2022, 7:56pm

Thank you, sir, this works wonderful, didn't even know about the runtime field context feature. Thanks a lot

system · August 17, 2022, 7:56pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Finding documents with message field exceeding 1 mln characters Elasticsearch painless	5	208	March 26, 2024
How to filter docs by text field length? Elasticsearch	2	9280	August 31, 2020
Finding message length Elasticsearch painless	4	955	March 14, 2022
Extremely Large Documents: Querying and Dealing with Elasticsearch	17	3023	October 28, 2021
Doc['message.keyword'] shows 0 value for given index but actually value is present in message field. What could be the reason? Elasticsearch painless	3	599	May 4, 2022

How to find number of characters in a text field

Related topics