Encoding is longer than the max length 32766

Jeff_Dupont · May 29, 2014, 8:47pm

We’re running into a peculiar issue when updating indexes with content for
the document.

"document contains at least one immense term in (whose utf8 encoding is
longer than the max length 32766), all of which were skipped. please
correct the analyzer to not produce such terms”

I’m hoping that there’s a simple fix or setting that can resolve this.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/01a22ff3-056d-4b54-8b28-a17f95d91f4b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

karmi · June 3, 2014, 4:18pm

This is actually a change in Lucene -- previously, the long term was
silently dropped, now it raises an exception, see Lucene
ticket [LUCENE-5710] DefaultIndexingChain swallows useful information from MaxBytesLengthExceededException - ASF JIRA

You might want to add a length filter to your analyzer
(Elasticsearch Platform — Find real-time answers at scale | Elastic).

All in all, it hints at some strange data, because such "immense" term
shouldn't probably be in the index in the first place.

Karel

On Thursday, May 29, 2014 10:47:37 PM UTC+2, Jeff Dupont wrote:

We’re running into a peculiar issue when updating indexes with content for
the document.

"document contains at least one immense term in (whose utf8 encoding is
longer than the max length 32766), all of which were skipped. please
correct the analyzer to not produce such terms”

I’m hoping that there’s a simple fix or setting that can resolve this.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a91895cb-437a-4642-8734-4445bb420125%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrew_Mehler · July 1, 2014, 7:22pm

For not analyzed fields, Is there a way of capturing the old behavior?
From what I can tell, you need to specify a tokenizer to have a token
filter.

On Tuesday, June 3, 2014 12:18:37 PM UTC-4, Karel Minařík wrote:

This is actually a change in Lucene -- previously, the long term was
silently dropped, now it raises an exception, see Lucene ticket
[LUCENE-5710] DefaultIndexingChain swallows useful information from MaxBytesLengthExceededException - ASF JIRA

You might want to add a length filter to your analyzer (
Elasticsearch Platform — Find real-time answers at scale | Elastic
).

All in all, it hints at some strange data, because such "immense" term
shouldn't probably be in the index in the first place.

Karel

On Thursday, May 29, 2014 10:47:37 PM UTC+2, Jeff Dupont wrote:

We’re running into a peculiar issue when updating indexes with content
for the document.

"document contains at least one immense term in (whose utf8 encoding is
longer than the max length 32766), all of which were skipped. please
correct the analyzer to not produce such terms”

I’m hoping that there’s a simple fix or setting that can resolve this.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/26e3ad78-65a3-4853-ad26-8836c7bc2c7c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

rore · October 30, 2014, 10:43am

+1 on this question.

If the error is generated because of a not_analyzed field, how is it
possible to instruct ES to drop these values instead of failing the request?

On Tuesday, July 1, 2014 10:22:54 PM UTC+3, Andrew Mehler wrote:

For not analyzed fields, Is there a way of capturing the old behavior?
From what I can tell, you need to specify a tokenizer to have a token
filter.

On Tuesday, June 3, 2014 12:18:37 PM UTC-4, Karel Minařík wrote:

This is actually a change in Lucene -- previously, the long term was
silently dropped, now it raises an exception, see Lucene ticket
[LUCENE-5710] DefaultIndexingChain swallows useful information from MaxBytesLengthExceededException - ASF JIRA

You might want to add a length filter to your analyzer (
Elasticsearch Platform — Find real-time answers at scale | Elastic
).

All in all, it hints at some strange data, because such "immense" term
shouldn't probably be in the index in the first place.

Karel

On Thursday, May 29, 2014 10:47:37 PM UTC+2, Jeff Dupont wrote:

We’re running into a peculiar issue when updating indexes with content
for the document.

"document contains at least one immense term in (whose utf8 encoding is
longer than the max length 32766), all of which were skipped. please
correct the analyzer to not produce such terms”

I’m hoping that there’s a simple fix or setting that can resolve this.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8e6acaf8-7101-4d04-9566-43ea8845013c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

asthanaamish · January 2, 2015, 7:56pm

How does this MAX_LENGTH restriction impact on a custom_all field where we
may be copying data from different fields using some analyzer.
Is the MAX_LENGTH restriction also applicable on such custom_all field
which in turn implies that in such a case cumulative length is what matters.
amish

On Thursday, October 30, 2014 3:43:26 AM UTC-7, Rotem wrote:

+1 on this question.

If the error is generated because of a not_analyzed field, how is it
possible to instruct ES to drop these values instead of failing the request?

On Tuesday, July 1, 2014 10:22:54 PM UTC+3, Andrew Mehler wrote:

For not analyzed fields, Is there a way of capturing the old behavior?
From what I can tell, you need to specify a tokenizer to have a token
filter.

On Tuesday, June 3, 2014 12:18:37 PM UTC-4, Karel Minařík wrote:

This is actually a change in Lucene -- previously, the long term was
silently dropped, now it raises an exception, see Lucene ticket
[LUCENE-5710] DefaultIndexingChain swallows useful information from MaxBytesLengthExceededException - ASF JIRA

You might want to add a length filter to your analyzer (
Elasticsearch Platform — Find real-time answers at scale | Elastic
).

All in all, it hints at some strange data, because such "immense" term
shouldn't probably be in the index in the first place.

Karel

On Thursday, May 29, 2014 10:47:37 PM UTC+2, Jeff Dupont wrote:

We’re running into a peculiar issue when updating indexes with content
for the document.

"document contains at least one immense term in (whose utf8 encoding is
longer than the max length 32766), all of which were skipped. please
correct the analyzer to not produce such terms”

I’m hoping that there’s a simple fix or setting that can resolve this.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f19fafb9-a9e0-42a9-b290-a9b37d1da51d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · January 3, 2015, 5:00am

The max length restriction is per token so its unlikely you'll see it
unless use not_analyzed fields. You can work around it by setting the
ignore_above option on the string type. That'll just throw away the token.

Nik
How does this MAX_LENGTH restriction impact on a custom_all field where we
may be copying data from different fields using some analyzer.
Is the MAX_LENGTH restriction also applicable on such custom_all field
which in turn implies that in such a case cumulative length is what matters.
amish

On Thursday, October 30, 2014 3:43:26 AM UTC-7, Rotem wrote:

+1 on this question.

If the error is generated because of a not_analyzed field, how is it
possible to instruct ES to drop these values instead of failing the request?

On Tuesday, July 1, 2014 10:22:54 PM UTC+3, Andrew Mehler wrote:

For not analyzed fields, Is there a way of capturing the old behavior?
From what I can tell, you need to specify a tokenizer to have a token
filter.

On Tuesday, June 3, 2014 12:18:37 PM UTC-4, Karel Minařík wrote:

This is actually a change in Lucene -- previously, the long term was
silently dropped, now it raises an exception, see Lucene ticket
[LUCENE-5710] DefaultIndexingChain swallows useful information from MaxBytesLengthExceededException - ASF JIRA

You might want to add a length filter to your analyzer (
Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/analysis-length-tokenfilter.html#
analysis-length-tokenfilter).

All in all, it hints at some strange data, because such "immense" term
shouldn't probably be in the index in the first place.

Karel

On Thursday, May 29, 2014 10:47:37 PM UTC+2, Jeff Dupont wrote:

We’re running into a peculiar issue when updating indexes with content
for the document.

"document contains at least one immense term in (whose utf8 encoding is
longer than the max length 32766), all of which were skipped. please
correct the analyzer to not produce such terms”

I’m hoping that there’s a simple fix or setting that can resolve this.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f19fafb9-a9e0-42a9-b290-a9b37d1da51d%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f19fafb9-a9e0-42a9-b290-a9b37d1da51d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2mOPdgd0p_DMek-a4hoiY7_2eT96KUv_biO%3D3ZgPLb4A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
[ELASTICSEARCH] UTF8 encoding is longer than the max length 32766 Discussions en français	1	1645	July 6, 2017
UTF8 encoding is longer than the max length 32766 Elasticsearch	4	17612	July 6, 2017
Reindex issue Elasticsearch	2	426	March 3, 2021
Bytes can be at most 32766 in length Elasticsearch	9	14359	February 2, 2020
Document contains at least one immense term in field="REGIONS" (whose UTF8 e Elasticsearch	3	1537	July 5, 2017

Encoding is longer than the max length 32766

Related topics