Truncate token filter fails on some strings


(Benjamin Gathmann) #1

I sometimes have very long strings that I don't want to analyze completely.
So I tested the truncate filter, but somehow it fails on some strings.

I use ES 2.4.

First, here is my custom analyzer:

"analysis": {
       "analyzer": {
         "analyzer_keyword": {
           "filter": ["lowercase","customTruncateFilter"],
           "tokenizer": "keyword"
         }
       },
	"filter": {
		"customTruncateFilter": {
			"type":"truncate",
			"length": 150
		}				
	}
     }

The following string is truncated correctly:

GET /advinion_chartsxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzziiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii.php HTTP/1.1

I.e. I get a match for "xxxx", but not for "iiii"

The following string (length=255) not:

Data Ascii: lGdj5WdmhCbhZXZ';function _0I0(data){var _1O0lOI="ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=";var o1,o2,o3,h1,h2,h3,h4,bits,i=0,enc='';do{h1=_1O0lOI.indexOf(data.charAt(i++));h2=_1O0lOI.indexOf(data.charAt(i++));h3=_1O0l

i.e. if I search for "1O0l", the document matches on this string

I suspect it is may be related to special chars in the string? (btw Scripting is off = default)

What is also weird is that without using the "truncate" filter, my cluster is smaller (68 vs. 70 GB)
I expected it to be smaller when truncating long strings.


(Benjamin Gathmann) #2

Sorry, I just noted that "1O0l" appears several times in the mentioned string, not just at the end.

So indeed, the token filter works. :slight_smile:

But concerning my other question: Why can a cluster with truncate filter applied be bigger than without?


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.