Error "String length exceeds the maximum length (5000000)" when transferring a large document to the attachment pipeline

Vlad_I · October 25, 2023, 3:25am

I'm using Elasticsearch 8.9.1
Using python, I send an 8MB xlsx document to the Elasticsearch index via attachment pipeline. But the error "String length (5046272) exceeds the maximum length (5000000)" appears.
For example, documents with a size of 1MB do not cause an error.
How can I increase the maximum string length or otherwise avoid this error?

dadoonet · November 7, 2023, 12:38pm

Welcome!

Could you share the full stack trace please?

That's one of the reason I prefer doing the extraction on the local machine before sending the data to Elasticsearch. This is what FSCrawler is doing.

But the error "String length (5046272) exceeds the maximum length (5000000)" appears.

The default limit of 5mb have been increased to 20mb. See:

github.com/FasterXML/jackson-core

Increase default max allowed String value length from 5 megs to 20 megs

opened 03:09AM - 05 May 23 UTC

closed 02:59AM - 14 May 23 UTC

cowtowncoder

performance 2.15

(note: possible follow-up to #959) Based on user/dev feedback wrt 2.15.0 rele…ase, there seem to be use cases where JSON String values (note: NOT Document size) exceed 5 million characters. While this is configurable limit it seems best to try to change default to be yet more lenient, given that this is not the likeliest way to try to cause DoS. It is also the case that those who do wish to use more modest sizes will anyway need to lower limits, regardless of defaults we choose. My initial thinking is to increase this further to 20 million (giving max memory usage of about 40 megs in JVM -- although temporarily up to twice that due to segmented storage), for Jackson 2.15.1.

It's part of Jackson 2.15.1.
But AFAICS 2.15.1 is not yet part of 8.10.4...

Although it is supposed to be 2gb....

github.com

elastic/elasticsearch/blob/a813d015ef1826148d9d389bd1c0d781c6e349f0/libs/x-content/impl/src/main/java/org/elasticsearch/xcontent/provider/json/JsonXContentImpl.java#L50-L52


      
          // jackson 2.15 introduced a max string length. We have other limits in place to constrain max doc size,
          // so here we set to max value (2GiB) so as not to constrain further than those existing limits.
          builder.streamReadConstraints(StreamReadConstraints.builder().maxStringLength(Integer.MAX_VALUE).build());

dadoonet · November 7, 2023, 12:53pm

I added a comment here:

Vlad_I · November 8, 2023, 2:37am

The full error message is below. I don't know how to get full stack trace.

{'root_cause': [{'type': 'document_parsing_exception', 'reason': "[-1:21] failed to parse field [attachment.content] of type [text] in document with id 'X5mWECsLu1Sp_VdsN7Bcng=='. Could not parse field value preview,"}], 'type': 'document_parsing_exception', 'reason': "[-1:21] failed to parse field [attachment.content] of type [text] in document with id 'X5mWECsLu1Sp_VdsN7Bcng=='. Could not parse field value preview,", 'caused_by': {'type': 'stream_constraints_exception', 'reason': 'String length (5046272) exceeds the maximum length (5000000)'}}

To avoid the problem described, I'm using Elasticsearch 8.7.1. This version handles 5MB documents successfully, but 44MB documents cause the error below.

{'root_cause': [{'type': 'parse_exception', 'reason': 'Error parsing document in field [data]'}], 'type': 'parse_exception', 'reason': 'Error parsing document in field [data]', 'caused_by': {'type': 'tika_exception', 'reason': 'Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$2@5c3c5810', 'caused_by': {'type': 'record_format_exception', 'reason': 'Tried to allocate an array of length 108,463,468, but the maximum length for this record type is 100,000,000.\nIf the file is not corrupt and not large, please open an issue on bugzilla to request \nincreasing the maximum allowable size for this record type.\nYou can set a higher override value with IOUtils.setByteArrayMaxOverride()'}}}

Since I need to index very large MS Word/Excel documents, I would like to solve this problem too.

dadoonet · November 8, 2023, 3:04am

You can't do that with the ingest attachment plugin IMO. But good news, we just released this in 8.11: Content extraction | Enterprise Search documentation [8.11] | Elastic

I did not test it yet. And I don't know how you can directly use this service if possible.

If it still does not work for you, may be give a try to FSCrawler.

dadoonet · November 8, 2023, 3:25am

It should be in Elasticsearch logs.

system · December 6, 2023, 3:26am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Maximum allowed string Issue Elasticsearch	6	2076	April 18, 2023
FsCrawler 2.10 Rest Service upload returns error for file more than 20 MB Elasticsearch	13	1310	October 16, 2023
Unable to index a file (Word document) greater than 45 MB Elasticsearch	6	558	June 3, 2021
UTF8 encoding is longer than the max length 32766 Elasticsearch	4	17793	July 6, 2017
Maximum length for highlighting exceeded - What can I do? Elasticsearch	4	1218	May 1, 2018

Error "String length exceeds the maximum length (5000000)" when transferring a large document to the attachment pipeline

Related topics