Keyword/Ignore_above seems to be ignored with Custom Analyzer

I've seen a historical post(closed) without resolution on this - apologies if I'm whipping a dead horse.

tldr; implementation of a Custom Analyzer leads to Ingestion-Pipeline Errors of "Document contains at least one immense term in ......"

Not applying the custom Analyzer to the field does not generate this error. Assigned ignore_above to this field is not an option due to its "type": "text"

Recently had to establish a custom analyzer: (truncated a bit for this post)

"settings": {
"mapping": {
"total_fields": {
"limit": "2000"
},
"ignore_malformed": "true"
},
"analysis": {
"filter": {
"hyphen_delimiter": {
"split_on_numerics": "false",
"generate_word_parts": "true",
"preserve_original": "true",
"catenate_words": "false",
"generate_number_parts": "true",
"catenate_all": "false",
"split_on_case_change": "false",
"type": "word_delimiter",
"catenate_numbers": "false",
"stem_english_possessive": "false"
}
},
"analyzer": {
"slwhitespace": {
"filter": [
"lowercase",
"asciifolding",
"hyphen_delimiter"
],
"type": "custom",
"tokenizer": "punctuation"
}
},
"tokenizer": {
"punctuation": {
"pattern": "[ ~`>=<;:,'&%#!"\?\+\*\|\{\}\[\]\(\)\\\^\$\r\n\t]",
"type": "pattern"
}
}
}

When Applied Mapping References the Custom Analyzer:

"documentcontent": {
"properties": {
"content": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
},
"analyzer": "slwhitespace"
}
}
}

A test document containing an exceptionally long string results in the following error:
{ServerError: 400Type: illegal_argument_exception Reason: "Document contains at least one immense term in field="documentcontent.content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[49, 50, 103, 101, 114, 115, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120]...', original message: bytes can be at most 32766 in length; got 1853700" CausedBy: "Type: max_bytes_length_exceeded_exception Reason: "bytes can be at most 32766 in length; got 1853700""}

When no Analyzer is specified:

"documentcontent": {
"properties": {
"content": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}

No error is generated, field is ignored as expected.

Any recommendations or insight into this behavior would be greatly appreciated.

-N

note - Running an older version of ES, 5.4.1

The error comes from Lucene when storing the outputs of the Analyzer in the index.
It's a sensible safeguard against massive terms.
I'd recommend adjusting your Analyzer definition to apply a length filter to avoid issues with rogue content in documents.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.