Analyzer conditional token filter with regular expression

Hello, in an analyzer conditional token filter, I use a painless script with the regular expression. If a token contains only letters et hypens, then the compound word in the token is splitted, otherwise no.

But my script below doesn't work. If I remove the conditional token filer, I get several words. What's wrong? Thanks.

GET /test/_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "condition",
      "filter": ["word_delimiter_graph"],
      "script": {
        "lang": "painless",
        "source": "token.toString() ==~ /^[A-Za-z-]+$/"
      }
    }
  ],
  "explain": true,
  "text": "WORD-IS-SPLITTED"
}

In the response, the word is not splitted:

"tokenfilters" : [
      {
        "name" : "__anonymous__condition",
        "tokens" : [
          {
            "token" : "WORD-IS-SPLITTED",
            "start_offset" : 0,
            "end_offset" : 16,
            "type" : "word",
            "position" : 0,
            "bytes" : "[42 4f 49 2d 49 53 2d 47 50 45]",
            "keyword" : false,
            "positionLength" : 1,
            "termFrequency" : 1
          }
        ]
      }
    ]

Hi @Wonder_Garance

Replace token.toString() to token.getTerm().toString()

GET /test/_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "condition",
      "filter": ["word_delimiter_graph"],
      "script": {
        "lang": "painless",
        "source": "token.getTerm().toString() ==~ /^[A-Za-z-]+$/"
      }
    }
  ],
  "text": "WORD-IS-SPLITTED"
}

Hi @RabBit_BR
It works, thank you!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.