Pattern analyzer does not respect max_token_length


(Eric Miller) #1

I'm testing with ES 2.3.5. My tests show that the pattern analyzer does not respect the max_token_length value.

First I try the standard analyzer.

POST /_analyze
{
  "analyzer": "standard",
  "text": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"
}

In the result I see the default 255 char max_token_length value applied.

{
  "tokens": [
    {
      "token": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
      "start_offset": 0,
      "end_offset": 255,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy",
      "start_offset": 255,
      "end_offset": 295,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

I try the same thing with the pattern analyzer.

 POST /_analyze
    {
      "analyzer": "pattern",
      "text": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"
    }

The result is one token longer than the 255 character limit.

{
  "tokens": [
    {
      "token": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy",
      "start_offset": 0,
      "end_offset": 295,
      "type": "word",
      "position": 0
    }
  ]
}

I see similar failure when I explicitly set max_token_length or when I try to use my own pattern analyzers and tokenizers with a max_token_length value.

The big problem I'm trying to solve is some of my not_analyzed data has values of length greater than 32,766 characters. This triggers a Lucene error because I get terms greater than the Lucene max length (as well discussed on Stackoverflow UTF-8 Encoding Longer Than Max). I still want some search and sort capability on these long field values, even if it is not perfect.


(Nik Everett) #2

Looks like a bug to me. I filed

The setting is documented as defaulting to 255 but we're not splitting there. I took a look at the code and didn't see anything about maximum token length either. I certainly could have missed it because it was only a quick look though.


(system) #3