How to limit token length?

Analyzed or not, we want to limit length of one token in ES. Is there a way to enforce this?

E.g.

  1. "Quick brown fox SomeTextWithoutSpacesExceedingLimit" to "Quick brown fox SomeTextTruncatedWithConfiguredLimit".
  2. "SomeTextWithoutSpacesExceedingLimit" to "SomeTextTruncatedWithConfiguredLimit"

Here we want to truncate "SomeTextWithoutSpacesExceedingLimit" to some configurable limit.

Thank you.

Hi, animageofmine

I don't understood what task are you solving.
But you can use Pattern tokenizer, or ngram tokenizer
https://www.elastic.co/guide/en/elasticsearch/reference/5.2/analysis-pattern-tokenizer.html#analysis-pattern-tokenizer

here example, you can use java pattern to define mask of your token. This pattern divided your text to token beginning with upper litera:"pattern": "(?=\p{Upper})"

curl -XPUT localhost:9curl -XPUT localhost:9200/test_token_upper -d '{
"settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "(?=\\p{Upper})"
        }
      }
    }
  }
}'

and with _analyze you can test, how behave this pattern:

curl -XPOST localhost:9200/test_token_upper/_analyze?pretty -d '{
"analyzer": "my_analyzer",
"text": "AaBbZzzzzz"
}'

and as result you will take:

~$ curl -XPOST localhost:9200/test_token_upper/_analyze?pretty -d '{
"analyzer": "my_analyzer",
"text": "AaBbZzzzzz"
}'
{
"tokens" : [
{
"token" : "Aa",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "Bb",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "Zzzzzz",
"start_offset" : 4,
"end_offset" : 10,
"type" : "word",
"position" : 2
}
]
}

or you can use standard tokenizer which has parameter max_token_length, which divid your text on token with that length:

},
"tokenizer": {
"my_tokenizer": {
"type": "standard",
"max_token_length":5
}

@nugusbayevkk

Thank you. Somehow, my message was updated because I used "<" & ">" in my examples. Just fixed it

max_token_length seems to be closest to what we want, however, it seems to split the token after it reaches maximum token length. We want to truncate it since we don't care about it.

Use Case:
Sometimes, customers send in some random gibberish text that dose not make any sense:
e.g. "asdfasdf....." of may be 1 MB or some text in a unsupported language. We simply want to truncate such values instead of analyzing them.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.