Custom analyzer with standard tokenizer is splitting long tokens instead of discarding

mikeb7986 · March 16, 2016, 3:46pm

I have a custom mapping that seems like it should be discarding the text, based on the documentation here: https://www.elastic.co/guide/en/elasticsearch/reference/2.2/analysis-standard-tokenizer.html :

max_token_length
The maximum token length. If a token is seen that exceeds this length then it is discarded. Defaults to 255.

Here's a sample request to the _analyze method:

POST http://localhost:9200/_analyze
{
    "tokenizer": "standard",
    "filters": ["standard", "lowercase"],
    "text": "JTNDZGl2JTIwY2xhc3MlM0QlMjJ3cGJfdmlkZW9fd2lkZ2V0JTIwd3BiX2NvbnRlbnRfZWxlbWVudCUyMiUzRSUwQSUzQ2RpdiUyMGNsYXNzJTNEJTIyd3BiX3dyYXBwZXIlMjIlM0UlM0NkaXYlMjBjbGFzcyUzRCUyMndwYl92aWRlb193cmFwcGVyJTIyJTNFJTNDaWZyYW1lJTIwd2lkdGglM0QlMjI2MjUlMjIlMjBoZWlnaHQlM0QlMjIzNTIlMjIlMjBzcmMlM0QlMjJodHRwJTNBJTJGJTJGd3d3LnlvdXR1YmUuY29tJTJGZW1iZWQlMkZ4Nld6eVVnYlQ1QSUzRmZlYXR1cmUlM0RvZW1iZWQlMjIlMjBmcmFtZWJvcmRlciUzRCUyMjAlMjIlMjBhbGxvd2Z1bGxzY3JlZW4lM0UlM0MlMkZpZnJhbWUlM0UlM0MlMkZkaXYlM0UlMEElM0MlMkZkaXYlM0UlMjAlM0MlMkZkaXYlM0U"
}

This shows that the text actually gets split, however the documentation of the standard tokenizer indicates it should be discarded. It seems like it's doing what the standard analyzer indicates it does with large tokens https://www.elastic.co/guide/en/elasticsearch/reference/2.2/analysis-standard-analyzer.html :

max_token_length
The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.

johtani · March 18, 2016, 10:00am

Hi @mikeb7986

You are right. Thanks for reporting!
We updated the document right now. see https://github.com/elastic/elasticsearch/commit/dc21ab75768ac9259ba8bf72d2d878e4e476de5a

The behaviour of the max_token_length changed in ES 1.4 .

FYI : If we have an index created before ES 1.4 on ES 1.5+, then an old segment still have the behavior.

mikeb7986 · March 18, 2016, 3:38pm

Well that makes more sense, @johtani is there a way to replicate the older functionality with the current version, some combination of filters perhaps?

I see that I can specify an older version of Lucene on the tokenizer/analyzer, but I wonder what other ramifications that has? Really what we want is to discard really large tokens. Any suggestions? Thanks!

johtani · March 19, 2016, 3:15pm

You can use limit token filter.

See : https://www.elastic.co/guide/en/elasticsearch/reference/2.2/analysis-limit-token-count-tokenfilter.html

Example :

POST http://localhost:9200/_analyze
{
    "tokenizer": "standard",
    "filters": ["limit", "lowercase"],
    "text": "JTNDZGl2JTIwY2xhc3MlM0QlMjJ3cGJfdmlkZW9fd2lkZ2V0JTIwd3BiX2NvbnRlbnRfZWxlbWVudCUyMiUzRSUwQSUzQ2RpdiUyMGNsYXNzJTNEJTIyd3BiX3dyYXBwZXIlMjIlM0UlM0NkaXYlMjBjbGFzcyUzRCUyMndwYl92aWRlb193cmFwcGVyJTIyJTNFJTNDaWZyYW1lJTIwd2lkdGglM0QlMjI2MjUlMjIlMjBoZWlnaHQlM0QlMjIzNTIlMjIlMjBzcmMlM0QlMjJodHRwJTNBJTJGJTJGd3d3LnlvdXR1YmUuY29tJTJGZW1iZWQlMkZ4Nld6eVVnYlQ1QSUzRmZlYXR1cmUlM0RvZW1iZWQlMjIlMjBmcmFtZWJvcmRlciUzRCUyMjAlMjIlMjBhbGxvd2Z1bGxzY3JlZW4lM0UlM0MlMkZpZnJhbWUlM0UlM0MlMkZkaXYlM0UlMEElM0MlMkZkaXYlM0UlMjAlM0MlMkZkaXYlM0U"
}

Topic		Replies	Views
How to limit token length? Elasticsearch	5	1887	April 24, 2017
Max length allowed for "max_token_length" and how to set value Elasticsearch	3	1721	July 5, 2017
Pattern analyzer does not respect max_token_length Elasticsearch	2	779	July 5, 2017
Encoding is longer than the max length 32766 Elasticsearch	6	7711	July 6, 2017
WhiteSpaceTokenizer buffer_size Elasticsearch	6	1295	July 5, 2017

Custom analyzer with standard tokenizer is splitting long tokens instead of discarding

Related topics