Create an analyzer to tokenize non-alphanumeric characters

I want a custom analyzer that can take a string like "((hello world!))" and give me a token list of:
["(", "(", "hello", "world", "!", ")", ")"]
That is to say, I basically want the "letter" tokenizer, but I want to keep the non letter characters and tokenize them as a single character length token.

Is this feasible?

Hey there,

have you considered using the pattern tokenizer to define your own regex for tokenization?

--Alex

Ah, ok - this isn't well documented but to be able to write a pattern for tokens (instead of delimiters, by default), you have to use the "group" value!
I've put together the following:

{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"filter":["lowercase"]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "(\W|\w+)",
"group": 1,
"flags":"UNICODE_CASE"
}
}
}
}
}

I'd ideally like to use the UNICODE_CHARACTER_CLASS Java regex flag, but I get an error when I include it. This means that values in Chinese, Japanese, etc, are being treated as non-letters and each letter is therefore a single token. Is there any way to do this without using CJK?

If you have an idea for improving the documentation for the tokenizer can you open an issue? I'd certainly be happy to review it.

Both UNICODE_CHAR_CLASS and UNICODE_CHARACTER_CLASS should work. What error are you seeing?

When I attempt to PUT

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer",
          "filter":["lowercase"]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "(\\W|\\w+)",
          "group": 1, 
          "flags":"UNICODE_CASE|UNICODE_CHARACTER_CLASS"
        }
      }
    }
  }
}

I get:

{
  "error": {
    "root_cause": [
      {
        "type": "index_creation_exception",
        "reason": "failed to create index"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "Unknown regex flag [UNICODE_CHARACTER_CLASS]"
  },
  "status": 400
}

This is on ES 2.3.5

Looks like only UNICODE_CHAR_CLASS is supported in 2.3: https://github.com/elastic/elasticsearch/blob/2.3/core/src/main/java/org/elasticsearch/common/regex/Regex.java#L153

And that was fixed in 5.0:

Which has yet had a production release.

Great, ok - we'll take another pass on this once 5.x is out, and we move to it!