Pattern analyzer does not respect max_token_length

rellimcire · October 1, 2016, 1:13am

I'm testing with ES 2.3.5. My tests show that the pattern analyzer does not respect the max_token_length value.

First I try the standard analyzer.

POST /_analyze
{
  "analyzer": "standard",
  "text": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"
}

In the result I see the default 255 char max_token_length value applied.

{
  "tokens": [
    {
      "token": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
      "start_offset": 0,
      "end_offset": 255,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy",
      "start_offset": 255,
      "end_offset": 295,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

I try the same thing with the pattern analyzer.

 POST /_analyze
    {
      "analyzer": "pattern",
      "text": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"
    }

The result is one token longer than the 255 character limit.

{
  "tokens": [
    {
      "token": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy",
      "start_offset": 0,
      "end_offset": 295,
      "type": "word",
      "position": 0
    }
  ]
}

I see similar failure when I explicitly set max_token_length or when I try to use my own pattern analyzers and tokenizers with a max_token_length value.

The big problem I'm trying to solve is some of my not_analyzed data has values of length greater than 32,766 characters. This triggers a Lucene error because I get terms greater than the Lucene max length (as well discussed on Stackoverflow UTF-8 Encoding Longer Than Max). I still want some search and sort capability on these long field values, even if it is not perfect.

nik9000 · October 1, 2016, 11:08am

Looks like a bug to me. I filed

github.com/elastic/elasticsearch

Issue: Pattern tokenizer doesn't respect max_token_length

opened by nik9000 on 2016-10-01

closed by clintongormley on 2016-10-08

Reported here.
 POST /_analyze
 {
 "analyzer": "pattern",
 "text": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy"
 }
Responds with:
{
 "tokens": [
 {
 "token": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy",
 "start_offset": 0,
 "end_offset": 295,
 "type":...

:Analysis docs v2.3.5 v6.0.0-alpha1

The setting is documented as defaulting to 255 but we're not splitting there. I took a look at the code and didn't see anything about maximum token length either. I certainly could have missed it because it was only a quick look though.

Topic		Replies	Views
How to limit token length? Elasticsearch	5	1882	April 24, 2017
Custom analyzer with standard tokenizer is splitting long tokens instead of discarding Elasticsearch	4	1221	July 5, 2017
WhiteSpaceTokenizer buffer_size Elasticsearch	6	1293	July 5, 2017
Max length allowed for "max_token_length" and how to set value Elasticsearch	3	1707	July 5, 2017
Encoding is longer than the max length 32766 Elasticsearch	6	7704	July 6, 2017

Pattern analyzer does not respect max_token_length

Related topics