Hi,
while indexing various comments from Facebook I sometimes get Exceptions:
IllegalArgumentException: Document contains at least one immense term...
Is it possible to sanitize a text for indexing in Elasticsearch so it doesn't throw these Exceptions? Maybe there is a Filter to remove too-long Unicode terms?
For details about the failing documents, see my (unanswered) Stackoverflow question: http://stackoverflow.com/questions/28941570/remove-long-unicode-terms-from-string-in-java
(I fear to break another Elasticsearch-based (Maillist) crawler, so I better don't write the failing doc text here )
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/93a5ed0d-6486-48b4-a228-1aff47d14ce0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
See
--
Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member
On Thu, Mar 12, 2015 at 10:52 AM, Bernhard Berger <
bernhardberger3456@gmail.com> wrote:
Hi,
while indexing various comments from Facebook I sometimes get Exceptions:
IllegalArgumentException: Document contains at least one immense term...
Is it possible to sanitize a text for indexing in Elasticsearch so it doesn't throw these Exceptions? Maybe there is a Filter to remove too-long Unicode terms?
For details about the failing documents, see my (unanswered) Stackoverflow question: encoding - Remove long unicode terms from String in Java - Stack Overflow
(I fear to break another Elasticsearch-based (Maillist) crawler, so I better don't write the failing doc text here )
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/93a5ed0d-6486-48b4-a228-1aff47d14ce0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/93a5ed0d-6486-48b4-a228-1aff47d14ce0%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZtqBSYcM9oFRa%3DGsWeafzHsE%3DSVMSa6H9e1aVfDbS2q%3Dg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
On 12.03.15 10:03, Itamar Syn-Hershko wrote:
See
Length token filter | Elasticsearch Guide [8.11] | Elastic
Unfortunately the length token filter also doesn't filter out these
immense terms.
See my example from Elasticsearch error · GitHub
: I have created a length filter for terms greater than 5000
(characters? bytes?) but still get the exception when using the
icu_normalizer :
|IllegalArgumentException: Document contains at least one immense term
in field="message" (whose UTF8 encoding is longer than the max length32766),|
( length of this message value is 3728 Bytes UTF8-encoded)
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5501637E.2070400%40gmail.com.
For more options, visit https://groups.google.com/d/optout.