Create an analyzer to tokenize non-alphanumeric characters


(Josh Harrison) #1

I want a custom analyzer that can take a string like "((hello world!))" and give me a token list of:
["(", "(", "hello", "world", "!", ")", ")"]
That is to say, I basically want the "letter" tokenizer, but I want to keep the non letter characters and tokenize them as a single character length token.

Is this feasible?


(Alexander Reelsen) #2

Hey there,

have you considered using the pattern tokenizer to define your own regex for tokenization?

--Alex


(Josh Harrison) #3

Ah, ok - this isn't well documented but to be able to write a pattern for tokens (instead of delimiters, by default), you have to use the "group" value!
I've put together the following:

{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter":["lowercase"]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "(\W|\w+)",
"group": 1,
"flags":"UNICODE_CASE"
}
}
}
}
}

I'd ideally like to use the UNICODE_CHARACTER_CLASS Java regex flag, but I get an error when I include it. This means that values in Chinese, Japanese, etc, are being treated as non-letters and each letter is therefore a single token. Is there any way to do this without using CJK?


(Nik Everett) #4

If you have an idea for improving the documentation for the tokenizer can you open an issue? I'd certainly be happy to review it.

Both UNICODE_CHAR_CLASS and UNICODE_CHARACTER_CLASS should work. What error are you seeing?


(Josh Harrison) #5

When I attempt to PUT

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer",
          "filter":["lowercase"]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "(\\W|\\w+)",
          "group": 1, 
          "flags":"UNICODE_CASE|UNICODE_CHARACTER_CLASS"
        }
      }
    }
  }
}

I get:

{
  "error": {
    "root_cause": [
      {
        "type": "index_creation_exception",
        "reason": "failed to create index"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Unknown regex flag [UNICODE_CHARACTER_CLASS]"
  },
  "status": 400
}

This is on ES 2.3.5


(Nik Everett) #6

Looks like only UNICODE_CHAR_CLASS is supported in 2.3: https://github.com/elastic/elasticsearch/blob/2.3/core/src/main/java/org/elasticsearch/common/regex/Regex.java#L153

And that was fixed in 5.0:

Which has yet had a production release.


(Josh Harrison) #7

Great, ok - we'll take another pass on this once 5.x is out, and we move to it!


(system) #8