Create an analyzer to tokenize non-alphanumeric characters

Josh_Harrison · September 20, 2016, 10:23pm

I want a custom analyzer that can take a string like "((hello world!))" and give me a token list of:
["(", "(", "hello", "world", "!", ")", ")"]
That is to say, I basically want the "letter" tokenizer, but I want to keep the non letter characters and tokenize them as a single character length token.

Is this feasible?

spinscale · September 21, 2016, 4:57pm

Hey there,

have you considered using the pattern tokenizer to define your own regex for tokenization?

--Alex

Josh_Harrison · September 23, 2016, 4:45pm

Ah, ok - this isn't well documented but to be able to write a pattern for tokens (instead of delimiters, by default), you have to use the "group" value!
I've put together the following:

{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter":["lowercase"]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "(\W|\w+)",
"group": 1,
"flags":"UNICODE_CASE"
}
}
}
}
}

I'd ideally like to use the UNICODE_CHARACTER_CLASS Java regex flag, but I get an error when I include it. This means that values in Chinese, Japanese, etc, are being treated as non-letters and each letter is therefore a single token. Is there any way to do this without using CJK?

nik9000 · September 23, 2016, 5:01pm

If you have an idea for improving the documentation for the tokenizer can you open an issue? I'd certainly be happy to review it.

Both UNICODE_CHAR_CLASS and UNICODE_CHARACTER_CLASS should work. What error are you seeing?

Josh_Harrison · September 23, 2016, 5:15pm

When I attempt to PUT

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer",
          "filter":["lowercase"]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "(\\W|\\w+)",
          "group": 1, 
          "flags":"UNICODE_CASE|UNICODE_CHARACTER_CLASS"
        }
      }
    }
  }
}

I get:

{
  "error": {
    "root_cause": [
      {
        "type": "index_creation_exception",
        "reason": "failed to create index"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Unknown regex flag [UNICODE_CHARACTER_CLASS]"
  },
  "status": 400
}

This is on ES 2.3.5

nik9000 · September 23, 2016, 5:45pm

Looks like only UNICODE_CHAR_CLASS is supported in 2.3: https://github.com/elastic/elasticsearch/blob/2.3/core/src/main/java/org/elasticsearch/common/regex/Regex.java#L153

And that was fixed in 5.0:

Which has yet had a production release.

Josh_Harrison · September 23, 2016, 5:48pm

Great, ok - we'll take another pass on this once 5.x is out, and we move to it!

Topic		Replies	Views
Analyzer to keep integers/numeric inputs Elasticsearch	5	687	November 2, 2017
Pattern analyzer regex help Elasticsearch	3	258	August 24, 2022
Custom tokenizer for letter and digits Elasticsearch	3	865	April 27, 2020
Cjk and thai analyzer customization Elasticsearch	4	696	July 6, 2017
Phrases with special characters Elasticsearch	1	1390	July 6, 2017

Create an analyzer to tokenize non-alphanumeric characters

Related topics