IllegalArgumentException: Document contains at least one immense term in field=“abc”.(whose UTF8 encoding is longer than the max length 32766)

dimalini · August 14, 2017, 2:48am

I have a field, the mapping of which looks like the following

new TextProperty
{
Name = "allContent",
Analyzer = "contentanalyzer",
IndexOptions = IndexOptions.Offsets,
Norms = false
}
The analyzer I am using is a custom one- basically tries to do a regex parsing. And _source is enabled in my case.

During indexing, i get the following exception java.lang.IllegalArgumentException: Document contains at least one immense term in field="allContent" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[48, 120, 52, 100, 53, 97, 57, 48, 48, 48...]...', original message: bytes can be at most 32766 in length; got 627762

I see, the ignore_above option could have been used had this field been a keyword type. But the field here is of type text.

What would be the solution here ?

TIA ~Divya
[https://stackoverflow.com/questions/45653094/illegalargumentexception-document-contains-at-least-one-immense-term-in-field]

dimalini · August 14, 2017, 5:28am

Basically, it would be helpful if there is some way, where i could say ignore the term that has more than supported byte length.

rkalhans · August 14, 2017, 5:53am

Hello Divya,

One way to ensure this is that the regex should be created in a way that it
runs only certain number of iterations. This will be particularly useful
when working with unknowns content, since unbounded regex matches will lead
to catastrophic failures like OOM etc. In practice Regex should not run
unlimited number of times, esp in prod.

Check this link about limiting the matches to a limited length.

here is an example .

Regex: ((class) {1,3060}(\w{1,10})[ ]{1,3060})\{
matches any content like class MyClassName { and groups the keywords and the class Name.

Check explanation here. https://regex101.com/
H2H
~Rohit

system · September 11, 2017, 5:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Please correct the analyzer to not produce such terms Elasticsearch	2	2756	July 5, 2017
Indexing Large Documents in ES Elasticsearch	10	890	April 26, 2020
UTF8 encoding is longer than the max length 32766 Elasticsearch	4	17755	July 6, 2017
Document contains at least one immense term error Elasticsearch language-clients	3	2228	June 2, 2022
ElasticSearch 6.2.4 java.lang.IllegalArgumentException: Document contains at least one immense term Elasticsearch	3	1944	November 23, 2018

IllegalArgumentException: Document contains at least one immense term in field=“abc”.(whose UTF8 encoding is longer than the max length 32766)

Related topics