Truncate token filter fails on some strings

Benjamin_Gathmann · February 16, 2017, 9:33am

I sometimes have very long strings that I don't want to analyze completely.
So I tested the truncate filter, but somehow it fails on some strings.

I use ES 2.4.

First, here is my custom analyzer:

"analysis": {
       "analyzer": {
         "analyzer_keyword": {
           "filter": ["lowercase","customTruncateFilter"],
           "tokenizer": "keyword"
         }
       },
	"filter": {
		"customTruncateFilter": {
			"type":"truncate",
			"length": 150
		}				
	}
     }

The following string is truncated correctly:

GET /advinion_chartsxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzziiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii.php HTTP/1.1

I.e. I get a match for "xxxx", but not for "iiii"

The following string (length=255) not:

Data Ascii: lGdj5WdmhCbhZXZ';function _0I0(data){var _1O0lOI="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";var o1,o2,o3,h1,h2,h3,h4,bits,i=0,enc='';do{h1=_1O0lOI.indexOf(data.charAt(i++));h2=_1O0lOI.indexOf(data.charAt(i++));h3=_1O0l

i.e. if I search for "1O0l", the document matches on this string

I suspect it is may be related to special chars in the string? (btw Scripting is off = default)

What is also weird is that without using the "truncate" filter, my cluster is smaller (68 vs. 70 GB)
I expected it to be smaller when truncating long strings.

Benjamin_Gathmann · February 16, 2017, 9:36am

Sorry, I just noted that "1O0l" appears several times in the mentioned string, not just at the end.

So indeed, the token filter works.

But concerning my other question: Why can a cluster with truncate filter applied be bigger than without?

system · March 16, 2017, 9:36am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Truncate token filter splits 32bits character Elasticsearch	4	399	July 29, 2019
Field analyzer ignored on query string regex - 7.10.2 Elasticsearch	1	448	May 25, 2021
Custom analyzer with standard tokenizer is splitting long tokens instead of discarding Elasticsearch	4	1193	July 5, 2017
Using the Truncate filter on keywords Elasticsearch	6	2767	December 11, 2018
Length Token Filter Elasticsearch	10	1721	July 6, 2017

Truncate token filter fails on some strings

Related topics