Custom analyzer with standard tokenizer is splitting long tokens instead of discarding


(Mike Barker) #1

I have a custom mapping that seems like it should be discarding the text, based on the documentation here: https://www.elastic.co/guide/en/elasticsearch/reference/2.2/analysis-standard-tokenizer.html :

max_token_length
The maximum token length. If a token is seen that exceeds this length then it is discarded. Defaults to 255.

Here's a sample request to the _analyze method:

POST http://localhost:9200/_analyze
{
    "tokenizer": "standard",
    "filters": ["standard", "lowercase"],
    "text": "JTNDZGl2JTIwY2xhc3MlM0QlMjJ3cGJfdmlkZW9fd2lkZ2V0JTIwd3BiX2NvbnRlbnRfZWxlbWVudCUyMiUzRSUwQSUzQ2RpdiUyMGNsYXNzJTNEJTIyd3BiX3dyYXBwZXIlMjIlM0UlM0NkaXYlMjBjbGFzcyUzRCUyMndwYl92aWRlb193cmFwcGVyJTIyJTNFJTNDaWZyYW1lJTIwd2lkdGglM0QlMjI2MjUlMjIlMjBoZWlnaHQlM0QlMjIzNTIlMjIlMjBzcmMlM0QlMjJodHRwJTNBJTJGJTJGd3d3LnlvdXR1YmUuY29tJTJGZW1iZWQlMkZ4Nld6eVVnYlQ1QSUzRmZlYXR1cmUlM0RvZW1iZWQlMjIlMjBmcmFtZWJvcmRlciUzRCUyMjAlMjIlMjBhbGxvd2Z1bGxzY3JlZW4lM0UlM0MlMkZpZnJhbWUlM0UlM0MlMkZkaXYlM0UlMEElM0MlMkZkaXYlM0UlMjAlM0MlMkZkaXYlM0U"
}

This shows that the text actually gets split, however the documentation of the standard tokenizer indicates it should be discarded. It seems like it's doing what the standard analyzer indicates it does with large tokens https://www.elastic.co/guide/en/elasticsearch/reference/2.2/analysis-standard-analyzer.html :

max_token_length
The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.


(Jun Ohtani) #2

Hi @mikeb7986

You are right. Thanks for reporting!
We updated the document right now. see https://github.com/elastic/elasticsearch/commit/dc21ab75768ac9259ba8bf72d2d878e4e476de5a

The behaviour of the max_token_length changed in ES 1.4 .

FYI : If we have an index created before ES 1.4 on ES 1.5+, then an old segment still have the behavior.


(Mike Barker) #3

Well that makes more sense, @johtani is there a way to replicate the older functionality with the current version, some combination of filters perhaps?

I see that I can specify an older version of Lucene on the tokenizer/analyzer, but I wonder what other ramifications that has? Really what we want is to discard really large tokens. Any suggestions? Thanks!


(Jun Ohtani) #4

You can use limit token filter.

See : https://www.elastic.co/guide/en/elasticsearch/reference/2.2/analysis-limit-token-count-tokenfilter.html

Example :

POST http://localhost:9200/_analyze
{
    "tokenizer": "standard",
    "filters": ["limit", "lowercase"],
    "text": "JTNDZGl2JTIwY2xhc3MlM0QlMjJ3cGJfdmlkZW9fd2lkZ2V0JTIwd3BiX2NvbnRlbnRfZWxlbWVudCUyMiUzRSUwQSUzQ2RpdiUyMGNsYXNzJTNEJTIyd3BiX3dyYXBwZXIlMjIlM0UlM0NkaXYlMjBjbGFzcyUzRCUyMndwYl92aWRlb193cmFwcGVyJTIyJTNFJTNDaWZyYW1lJTIwd2lkdGglM0QlMjI2MjUlMjIlMjBoZWlnaHQlM0QlMjIzNTIlMjIlMjBzcmMlM0QlMjJodHRwJTNBJTJGJTJGd3d3LnlvdXR1YmUuY29tJTJGZW1iZWQlMkZ4Nld6eVVnYlQ1QSUzRmZlYXR1cmUlM0RvZW1iZWQlMjIlMjBmcmFtZWJvcmRlciUzRCUyMjAlMjIlMjBhbGxvd2Z1bGxzY3JlZW4lM0UlM0MlMkZpZnJhbWUlM0UlM0MlMkZkaXYlM0UlMEElM0MlMkZkaXYlM0UlMjAlM0MlMkZkaXYlM0U"
}

(system) #5