Why do N-gram Tokenizers don't treat + (plus) characters as part of a token?

Hello everyone,

I have a few questions about the behavior of the N-Gram Tokenizer.
I am relatively new to this forum, so please forgive me if I'm asking this question in a wrong way.

I created my own custom tokenizer to use it with my custom analyzer.

The tokenizer looks as follows:

"titleNgramTokenizer": {
        "type": "ngram",
        "min_gram": 2,
        "max_gram": 2,
        "token_chars": [ ]
      }

The analyzer looks as follows:

"titleNgramAnalyzer": {
        "tokenizer": "titleNgramTokenizer",
        "filter": [
          "lowercase"
        ]
      }

Generally everything works fine. But...

My questions:

  1. I would expect that all characters are treated as part of a token as it is described in the documentation.

Character classes that should be included in a token. [...] Defaults to [] (keep all characters).

However all tokens are treated as part of a token except the + character. The documentation of the N-Gram Tokenizer specifies under "custom_token_chars":

Custom characters that should be treated as part of a token. For example, setting this to +-_ will make the tokenizer treat the plus, minus and underscore sign as part of a token.

So i assume this behavior is known and implemented on purpose.
What is the reason for this implementation? Why is the + character treated differently? And why is it not included in any character classes such as e.g. "symbol".

  1. In my implementation the - (minus) character and the _ character are treated as part of a token, the + character is not. From the above quoted part of the documentation (custom_token_chars), I would assume that all three characters +-_ would not be treated as part of a token.
    Any ideas on why only the + character is not treated as part of a token? Or am I reading too much into this part of the documentation?

  2. My solution to this problem would be to define the + character as custom token char as described in the above quoted documentation. Are there any dangers of doing this (i.e. of treating the + character as part of a token)? I am not using regex or query_string queries.


Thanks in advance for your help!

Hi Phillip,
I think you may have misinterpreted the docs. Plus symbols are kept by default. The analyze API is useful for debugging this:

POST test/_analyze
{
  "analyzer": "titleNgramAnalyzer",
  "text": ["foo+bar"]
}

The docs for the token_chars param says:

Defaults to [] (keep all characters).

Meaning it keeps pluses, minuses and everything else.
When it is a non-empty array that controls what characters are kept (and everything else is thrown away). The class value ["letter"] would only keep letters and not numbers. The value "custom" means you'll list the characters you want in the "custom_token_chars" parameter and the example randomly picks pluses and minuses. That particular choice might make more sense if your token_chars was set to ["digit", "custom"]` - that way you'd get the numbers along with any negative/positive symbols.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.