How to limit token length?

animageofmine · March 25, 2017, 3:49pm

Analyzed or not, we want to limit length of one token in ES. Is there a way to enforce this?

E.g.

"Quick brown fox SomeTextWithoutSpacesExceedingLimit" to "Quick brown fox SomeTextTruncatedWithConfiguredLimit".
"SomeTextWithoutSpacesExceedingLimit" to "SomeTextTruncatedWithConfiguredLimit"

Here we want to truncate "SomeTextWithoutSpacesExceedingLimit" to some configurable limit.

Thank you.

nugusbayevkk · March 25, 2017, 7:30pm

Hi, animageofmine

I don't understood what task are you solving.
But you can use Pattern tokenizer, or ngram tokenizer

here example, you can use java pattern to define mask of your token. This pattern divided your text to token beginning with upper litera:"pattern": "(?=\p{Upper})"

curl -XPUT localhost:9curl -XPUT localhost:9200/test_token_upper -d '{
"settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "(?=\\p{Upper})"
        }
      }
    }
  }
}'

and with _analyze you can test, how behave this pattern:

curl -XPOST localhost:9200/test_token_upper/_analyze?pretty -d '{
"analyzer": "my_analyzer",
"text": "AaBbZzzzzz"
}'

nugusbayevkk · March 25, 2017, 7:31pm

and as result you will take:

~$ curl -XPOST localhost:9200/test_token_upper/_analyze?pretty -d '{
"analyzer": "my_analyzer",
"text": "AaBbZzzzzz"
}'
{
"tokens" : [
{
"token" : "Aa",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "Bb",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "Zzzzzz",
"start_offset" : 4,
"end_offset" : 10,
"type" : "word",
"position" : 2
}
]
}

nugusbayevkk · March 25, 2017, 7:46pm

or you can use standard tokenizer which has parameter max_token_length, which divid your text on token with that length:

},
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length":5
}

animageofmine · March 27, 2017, 6:16pm

@nugusbayevkk

Thank you. Somehow, my message was updated because I used "<" & ">" in my examples. Just fixed it

max_token_length seems to be closest to what we want, however, it seems to split the token after it reaches maximum token length. We want to truncate it since we don't care about it.

Use Case:
Sometimes, customers send in some random gibberish text that dose not make any sense:
e.g. "asdfasdf....." of may be 1 MB or some text in a unsupported language. We simply want to truncate such values instead of analyzing them.

system · April 24, 2017, 6:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pattern analyzer does not respect max_token_length Elasticsearch	2	777	July 5, 2017
Custom analyzer with standard tokenizer is splitting long tokens instead of discarding Elasticsearch	4	1221	July 5, 2017
Is there a way for ES to config maximum size for a single term? Elasticsearch	3	479	July 5, 2017
Max length allowed for "max_token_length" and how to set value Elasticsearch	3	1707	July 5, 2017
Truncate token filter splits 32bits character Elasticsearch	4	419	July 29, 2019

How to limit token length?

Related topics