Hello everyone,
I have a few questions about the behavior of the N-Gram Tokenizer.
I am relatively new to this forum, so please forgive me if I'm asking this question in a wrong way.
I created my own custom tokenizer to use it with my custom analyzer.
The tokenizer looks as follows:
"titleNgramTokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 2,
"token_chars": [ ]
}
The analyzer looks as follows:
"titleNgramAnalyzer": {
"tokenizer": "titleNgramTokenizer",
"filter": [
"lowercase"
]
}
Generally everything works fine. But...
My questions:
- I would expect that all characters are treated as part of a token as it is described in the documentation.
Character classes that should be included in a token. [...] Defaults to
[]
(keep all characters).
However all tokens are treated as part of a token except the + character. The documentation of the N-Gram Tokenizer specifies under "custom_token_chars":
Custom characters that should be treated as part of a token. For example, setting this to +-_ will make the tokenizer treat the plus, minus and underscore sign as part of a token.
So i assume this behavior is known and implemented on purpose.
What is the reason for this implementation? Why is the + character treated differently? And why is it not included in any character classes such as e.g. "symbol".
-
In my implementation the - (minus) character and the _ character are treated as part of a token, the + character is not. From the above quoted part of the documentation (custom_token_chars), I would assume that all three characters +-_ would not be treated as part of a token.
Any ideas on why only the + character is not treated as part of a token? Or am I reading too much into this part of the documentation?
-
My solution to this problem would be to define the + character as custom token char as described in the above quoted documentation. Are there any dangers of doing this (i.e. of treating the + character as part of a token)? I am not using regex or query_string queries.
Thanks in advance for your help!