FsCrawler 2.10 Rest Service upload returns error for file more than 20 MB

Hi,
I am using FsCrawler 2.10 with elasticsearch 8.9, I am trying to upload a 20Mb .msg file using rest service of FsCrawler, but it gives error. Please note that I am able to upload smaller files without any issues.

Following is the debug log :

21:15:31,623 DEBUG [f.p.e.c.f.r.RestApi] Sending document [2021-05-29-1243-04-0000 abcd.msg] to elasticsearch.

21:15:31,743 ERROR [f.p.e.c.f.r.RestApi] Error parsing tags

com.fasterxml.jackson.core.exc.StreamConstraintsException: String length (20051112) exceeds the maximum length (20000000)

        at com.fasterxml.jackson.core.StreamReadConstraints.validateStringLength(StreamReadConstraints.java:324) ~[jackson-core-2.15.2.jar:2.15.2]

        at com.fasterxml.jackson.core.util.ReadConstrainedTextBuffer.validateStringLength(ReadConstrainedTextBuffer.java:27) ~[jackson-core-2.15.2.jar:2.15.2]

        at com.fasterxml.jackson.core.util.TextBuffer.finishCurrentSegment(TextBuffer.java:939) ~[jackson-core-2.15.2.jar:2.15.2]

        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2584) ~[jackson-core-2.15.2.jar:2.15.2]

        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2560) ~[jackson-core-2.15.2.jar:2.15.2]

        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:335) ~[jackson-core-2.15.2.jar:2.15.2]

        at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer._deserializeContainerNoRecursion(JsonNodeDeserializer.java:572) ~[jackson-databind-2.15.2.jar:2.15.2]

        at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:100) ~[jackson-databind-2.15.2.jar:2.15.2]

        at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:25) ~[jackson-databind-2.15.2.jar:2.15.2]

        at com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:323) ~[jackson-databind-2.15.2.jar:2.15.2]

        at com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4867) ~[jackson-databind-2.15.2.jar:2.15.2]

        at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:3199) ~[jackson-databind-2.15.2.jar:2.15.2]

        at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.getMergedJsonDoc(DocumentApi.java:269) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?]

        at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.uploadToDocumentService(DocumentApi.java:207) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?]

        at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.addDocument(DocumentApi.java:94) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?]

        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]

        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]

        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]

        at java.lang.reflect.Method.invoke(Method.java:567) ~[?:?]

        at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[jersey-server-3.1.3.jar:?]

        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) [jersey-server-3.1.3.jar:?]

        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) [jersey-server-3.1.3.jar:?]

        at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) [jersey-server-3.1.3.jar:?]

        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) [jersey-server-3.1.3.jar:?]

        at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) [jersey-server-3.1.3.jar:?]

        at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) [jersey-server-3.1.3.jar:?]

        at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) [jersey-server-3.1.3.jar:?]

        at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261) [jersey-server-3.1.3.jar:?]

        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [jersey-common-3.1.3.jar:?]

        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [jersey-common-3.1.3.jar:?]

        at org.glassfish.jersey.internal.Errors.process(Errors.java:292) [jersey-common-3.1.3.jar:?]

        at org.glassfish.jersey.internal.Errors.process(Errors.java:274) [jersey-common-3.1.3.jar:?]

        at org.glassfish.jersey.internal.Errors.process(Errors.java:244) [jersey-common-3.1.3.jar:?]

        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [jersey-common-3.1.3.jar:?]

        at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:240) [jersey-server-3.1.3.jar:?]

        at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:697) [jersey-server-3.1.3.jar:?]

        at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:367) [jersey-container-grizzly2-http-3.1.3.jar:?]

        at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [grizzly-http-server-4.0.0.jar:4.0.0]

        at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:535) [grizzly-framework-4.0.0.jar:4.0.0]

        at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:515) [grizzly-framework-4.0.0.jar:4.0.0]

        at java.lang.Thread.run(Thread.java:835) [?:?]```

What setting is should update and in which file to increase the maximum  allowed length (20000000).

Thanks for the help!!
1 Like

What are your fscrawler job settings? Are you sending the binary content to Elasticsearch from FSCrawler?

Yes i am using web client to upload the memory stream of the file as a byte array to the Fscrawler's rest api in .Net.

using (MemoryStream memoryStream = new MemoryStream())
 {
                            foreach (MimePart mimePart in mimeParts)
                            {
                                memoryStream.Write(mimePart.Header, 0, mimePart.Header.Length);
                                 while ((read = mimePart.Data.Read(buffer, 0, buffer.Length)) > 0)
                                    memoryStream.Write(buffer, 0, read);
                                    mimePart.Data.Dispose();
                                    memoryStream.Write(afterFile, 0, afterFile.Length);
                            }
                            memoryStream.Write(_footer, 0, _footer.Length);

// Passing the FsCrawler Rest service Url [ http://10.10.10.10:8080/fscrawler/_document ] along with the Files MemoryStream as byte array 
                        byte[] responseBytes = client.UploadData(FsCrawlerRestServiceUrl+"?debug=true",memoryStream.ToArray());

                        string responseString = Encoding.UTF8.GetString(responseBytes);

                        return new UploadResponse(HttpStatusCode.OK, responseString);

}

And below are the Fscrawler settings , Please note that i have .msg in excludes list as i am uploading the msg files using the rest api

I have not made any other settings apart from this.
The same email is getting uploaded if I use Fscrawler 2.9 snapshot with Elasticsearch 7.8

Settings for both the versions have same configurations.

Thanks

May be set the bulk size to 1 to see if it's ok.

You say you are using 2.9 snapshot. Is that true? If so, please use 2.10-snapshot.

Could you share the fscrawler logs from the start?

Please don't post images of text as they are hard to read, may not display correctly for everyone, and are not searchable.

Instead, paste the text and format it with </> icon or pairs of triple backticks (```), and check the preview window to make sure it's properly formatted before posting it. This makes it more likely that your question will receive a useful answer.

Thanks for the reply !

I changed the Bulk size to 1 , But it still doesn't work.
And i am using Fscrawler 2.10, I had just mentioned that if
i use FsCrawler 2.9 snapshot with Elasticsearch 7.8 the document is getting indexed,
but it is not indexed with FsCrawler 2.10 and Elasticsearch 8.9

                 FsCrawler           ElasticSearch verison         Doc Indexed
  1.                2.10                    8.9                        No
    
  2.                2.9 snapshot            7.8                        Yes
    

Following is the complete log

E:\FsCrawlerJobs2.10\Jobs\jobs_dataproduction\Services>set JAVA_HOME=C:\Program Files\Java\jdk-12.0.2
E:\FsCrawlerJobs2.10\Jobs\jobs_dataproduction\Services>set FS_JAVA_OPTS=-Xmx5g -Xms5g
E:\FsCrawlerJobs2.10\Jobs\jobs_dataproduction\Services>E:\fscrawler-2.10\bin\fscrawler.bat --config_dir \\10-SOLR01\FsCrawlerJobs2.10\Jobs jobs_dataproduction --rest --debug --trace --loop 0
21:37:24,594 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [4.8gb/5gb=97.85%], RAM [20gb/47.9gb=41.7%], Swap [18.1gb/85gb=21.3%].
21:37:25,593 DEBUG [f.p.e.c.f.c.FsCrawlerCli] Starting job [jobs_dataproduction]...
21:37:26,597 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
21:37:26,867 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/E:/fscrawler-2.10/lib/log4j-slf4j-impl-2.20.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
21:37:28,181 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version returns 8.9.2 and 8 as the major version number
21:37:28,182 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.9.2
21:37:28,195 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service started
21:37:28,208 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version
21:37:28,282 DEBUG [f.p.e.c.f.c.ElasticsearchClient] get version returns 8.9.2 and 8 as the major version number
21:37:28,283 INFO  [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch Client connected to a node running version 8.9.2
21:37:28,288 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Elasticsearch Document Service started
21:37:28,361 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [jobs_dataproduction]
21:37:28,405 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Error while running PUT http://10.10.40.155:9200/jobs_dataproduction: {"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [jobs_dataproduction/vb4JTrheSqGSLpGxExMyNg] already exists","index_uuid":"vb4JTrheSqGSLpGxExMyNg","index":"jobs_dataproduction"}],"type":"resource_already_exists_exception","reason":"index [jobs_dataproduction/vb4JTrheSqGSLpGxExMyNg] already exists","index_uuid":"vb4JTrheSqGSLpGxExMyNg","index":"jobs_dataproduction"},"status":400}
21:37:28,406 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Response for create index [jobs_dataproduction]: HTTP 400 Bad Request
21:37:28,448 DEBUG [f.p.e.c.f.c.ElasticsearchClient] create index [jobs_dataproduction_folder]
21:37:28,457 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Error while running PUT http://10.10.40.155:9200/jobs_dataproduction_folder: {"error":{"root_cause":[{"type":"resource_already_exists_exception","reason":"index [jobs_dataproduction_folder/3vvAN_MBRjSgsngcv6xtVA] already exists","index_uuid":"3vvAN_MBRjSgsngcv6xtVA","index":"jobs_dataproduction_folder"}],"type":"resource_already_exists_exception","reason":"index [jobs_dataproduction_folder/3vvAN_MBRjSgsngcv6xtVA] already exists","index_uuid":"3vvAN_MBRjSgsngcv6xtVA","index":"jobs_dataproduction_folder"},"status":400}
21:37:28,471 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Response for create index [jobs_dataproduction_folder]: HTTP 400 Bad Request
21:37:28,508 DEBUG [f.p.e.c.f.FsParserNoop] Fs crawler is going to sleep for 15m
21:37:29,006 WARN  [o.g.j.s.w.WadlFeature] JAXBContext implementation could not be found. WADL feature is disabled.
21:37:29,176 WARN  [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.ServerStatusApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.ServerStatusApi will be ignored.
21:37:29,177 WARN  [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.UploadApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.UploadApi will be ignored.
21:37:29,181 WARN  [o.g.j.i.i.Providers] A provider fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi registered in SERVER runtime does not implement any provider interfaces applicable in the SERVER runtime. Due to constraint configuration problems the provider fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi will be ignored.
21:37:29,807 INFO  [f.p.e.c.f.r.RestServer] FS crawler Rest service started on [http://10.10.40.105:8680/fscrawler]
21:42:16,866 DEBUG [f.p.e.c.f.r.RestApi] uploadToDocumentService(true, null, null, jobs_dataproduction, ...)
21:42:16,893 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated so we need to configure Tesseract in case we have specific settings.
21:42:16,897 DEBUG [f.p.e.c.f.t.TikaInstance] Tesseract Language set to [eng].
21:42:16,931 DEBUG [f.p.e.c.f.t.TikaInstance] OCR is activated.
21:42:16,964 DEBUG [f.p.e.c.f.t.TikaInstance] But Tesseract is not installed so we won't run OCR.
21:42:16,967 INFO  [f.p.e.c.f.t.TikaInstance] OCR is disabled.
21:42:19,204 WARN  [o.a.p.h.d.AttachmentChunks] Currently unsupported attachment chunk property will be ignored. __substg1.0_0FF90102
21:42:19,206 WARN  [o.a.p.h.d.AttachmentChunks] Currently unsupported attachment chunk property will be ignored. __substg1.0_3001001F
21:42:19,217 WARN  [o.a.p.h.d.AttachmentChunks] Currently unsupported attachment chunk property will be ignored. __properties_version1.0
21:42:19,220 WARN  [o.a.p.h.d.AttachmentChunks] Currently unsupported attachment chunk property will be ignored. __substg1.0_0FF90102
21:42:19,221 WARN  [o.a.p.h.d.AttachmentChunks] Currently unsupported attachment chunk property will be ignored. __substg1.0_3001001F
21:42:19,223 WARN  [o.a.p.h.d.AttachmentChunks] Currently unsupported attachment chunk property will be ignored. __properties_version1.0
21:42:20,226 DEBUG [f.p.e.c.f.r.RestApi] Sending document [2021-05-29-1243-04-0000 abcd.msg] to elasticsearch.
21:42:20,351 ERROR [f.p.e.c.f.r.RestApi] Error parsing tags
com.fasterxml.jackson.core.exc.StreamConstraintsException: String length (20051112) exceeds the maximum length (20000000)
        at com.fasterxml.jackson.core.StreamReadConstraints.validateStringLength(StreamReadConstraints.java:324) ~[jackson-core-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.core.util.ReadConstrainedTextBuffer.validateStringLength(ReadConstrainedTextBuffer.java:27) ~[jackson-core-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.core.util.TextBuffer.finishCurrentSegment(TextBuffer.java:939) ~[jackson-core-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2584) ~[jackson-core-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2560) ~[jackson-core-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:335) ~[jackson-core-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer._deserializeContainerNoRecursion(JsonNodeDeserializer.java:572) ~[jackson-databind-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:100) ~[jackson-databind-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:25) ~[jackson-databind-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:323) ~[jackson-databind-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4867) ~[jackson-databind-2.15.2.jar:2.15.2]
        at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:3199) ~[jackson-databind-2.15.2.jar:2.15.2]
        at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.getMergedJsonDoc(DocumentApi.java:269) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.uploadToDocumentService(DocumentApi.java:207) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.addDocument(DocumentApi.java:94) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:567) ~[?:?]
        at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [jersey-common-3.1.3.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [jersey-common-3.1.3.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:292) [jersey-common-3.1.3.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:274) [jersey-common-3.1.3.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:244) [jersey-common-3.1.3.jar:?]
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [jersey-common-3.1.3.jar:?]
        at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:240) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:697) [jersey-server-3.1.3.jar:?]
        at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:367) [jersey-container-grizzly2-http-3.1.3.jar:?]
        at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [grizzly-http-server-4.0.0.jar:4.0.0]
        at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:535) [grizzly-framework-4.0.0.jar:4.0.0]
        at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:515) [grizzly-framework-4.0.0.jar:4.0.0]
        at java.lang.Thread.run(Thread.java:835) [?:?]

The error is regarding the size of the request string

String length (20051112) exceeds the maximum length (20000000)

Is there any default request size configuration in Elasticsearch ?

1 Like

It's not related to Elasticsearch as the problem is happening I think before the document is sent to Elasticsearch.

The line which is failing is this one:

final Doc mergedDoc = this.getMergedJsonDoc(doc, tags);

Could you add ?simulate=true to your REST request and share the output as a gist.github.com file?
How exactly are you calling the REST Api? Are you using tags? Could you try without the tags?

Thanks again !

The problem is indeed with Tags, I was able to index email without tags.
I am adding the attachments in the email to external tag .

Adding external tags for the files with attachments in size lower than 20 mb is working and the emails are getting indexed. But this email has an attachment with size more than 20 mb.

Also , again as i mentioned i am able to index the same file with tags in older version of elasticsearch 7.8 with FSCrawler 2.9 snapshot version, Then why is it not indexing it with tags for new version 2.10 and Elasticsearch latest version 8.9?

Do we have any size restrictions for Tags ?
The Error suggest it is the size restriction

String length (20051112) exceeds the maximum length (20000000)

I am really sorry i am not able to share the output or the email file in question, because of privacy and security concern, i will check with the client if i am allowed to share the email file.

So this looks like a regression (a bug) in FSCrawler or a new limit that has been added in Jackson project. I need to check this. Could you open an issue in FSCrawler project?

Sir,

I have created an bug #1709. Please let me know in case I need to edit and add more details.

Thanks

1 Like

Hi David,

On further investigating, I have had the following observation.

Please check here this issue is because of Jackson-core's StreamReadConstraints.java where it is validating the string length and the default size is set to 20000000.

The StreamReadConstrains.java was introduced in 2.15 version of Jackson-core which is used in FsCrawler 2.10.

FsCrawler 2.9 is using Jackson-core's version 2.13 which does not contain the StreamReadConstrains.java file. This is the reason why the file was getting indexed using Fscrawler 2.9 snapshot and Elasticsearch version 7.8.

Can you please confirm if this is gonna be resolved or any provision will be provided in configurations to increase the length ?

Thank you for your findings. That helps a lot. Could you add this in the issue itself?

We could probably try to find a way to increase this default value if this setting can be set when we create the Jackson mapper.

But, that's raising me an "alarm"... Why a single document would have such a size. I'm fearing that something is wrong somewhere in FSCrawler code. store_source is set to false, indexed_chars is not set, so we are just extracting normally 100000 characters... I'm wondering what can produce a so huge document. I'm feeling that raw_metadata set to true is generating too much data.

It would be great to continue this discussion in the issue. But as a step forward, could you put FSCrawler in trace mode with --trace option and share the logs you are getting in a gist?

Yes the raw_metadata is set to true, But the string size problem is not because of the document's raw meta data but the tags which I am adding specifically external. I have created the issue #1709 , where i have mentioned

I am trying to upload an email file with a pdf attachment of size more than 20 MB using .Net webclient and Fscrawler rest service. The attachment is added to external tag which contains filename, content type, and data (containing base64 data of the file).

The upload works for attachments of smaller size.

The exception occurs for base64 data of the file which is greater than 20000000 size .. earlier i was able to index the documents with external tags greater than this length as there was no validation in Jackson-core.

Thanks and Regards

You can check the gist for the output here

Please note that this test file was successfully indexed using fscrawler 2.9 and Elasticsearch 7.8.
Click here to download the Json of indexed document.

Click here for the test file.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.