Default value http.content_content_length does not restricts ingestion of large documents

I am using FSCrawler to perform rich documents (PDF/PPT/DOCX etc.) ingestion into ElasticSearch and while doing some validation to check how much big file i can ingest, i tried ingesting a TXT file of 1GB and 2GB as well and to my surprise, ingestion went fine even though ElasticSearch configuration parameter HTTP.content_max_length was set to 100MB. Doesn't this parameter restricts the size of content being ingested into ElasticSearch?

Appreciate if anyone can provide their inputs on this.

Thanks
Sachin

How big looks the json document generated by FSCrawler?
By default it extracts not all characters but 10000 (IIRC).

I am extracting 100% document - have indexed_chars value set to 100%

When you do a GET for this document, is it sending you back a json of 1gb size?

when i fetch the document, it shows me the content but i think it may not be complete document because i see some text is cut out towards the end.

Since all lines in this text file are same (because content was created through a batch program), it is hard to identify how much content is cut out

FYI - this text file was created using below command for testing:

echo "This is just a sample line appended to create a big file.. " > dummy1G.txt
for /L %i in (1,1,10) do type dummy.txt >> dummy1G.txt

But is the json you are getting back is a valid json? I guess it is.

So I think the limit of 100% does not work most likely. I'd use -1 instead.

1 Like

I believe there was something wrong with my _settings.json file, so i created a new one and now the ingestion of 1GB size of TXT file is failing with "Out of Memory error" - below is the error

09:33:20,664 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [1119734816] characters of text for [dummy1G.txt]
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Unknown Source) ~[?:1.8.0_162]
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(Unknown Source) ~[?:1.8.0_162]
        at java.lang.AbstractStringBuilder.append(Unknown Source) ~[?:1.8.0_162]
        at java.lang.StringBuffer.append(Unknown Source) ~[?:1.8.0_162]
        at java.io.StringWriter.write(Unknown Source) ~[?:1.8.0_162]
        at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:104) ~[tika-parsers-1.16.jar:1.16]
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[tika-core-1.16.jar:1.16]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:124) ~[fscrawler-2.4.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:91) [fscrawler-2.4.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.indexFile(FsCrawlerImpl.java:671) [fscrawler-2.4.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.addFilesRecursively(FsCrawlerImpl.java:460) [fscrawler-2.4.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.run(FsCrawlerImpl.java:344) [fscrawler-2.4.jar:?]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_162]

I tried allocating more JVM to FSCrawler by setting the environment variable "FS_JAVA_OPTS" to 3GB but the ingestion still fails with "Out of Memory" error

set FS_JAVA_OPTS="-Xmx3072m"

When i drilled deeper by monitoring the heap usage via VisualVM tool, there was interesting observation.

Per VisualVM tool, it confirms that by setting above environment variable, JVM max heap has been allocated accordingly.

But then, if i look at the max heap that was consumed while ingestion was in-progress, it didn't go beyond 2GB - not sure why ( (interestingly, Max JVM Heap on this graph shows 2GB whereas I configured for 3GB)

I performed another test with 4GB max JVM allocation but the result was same - it didn't go beyond ~2GB (interestingly, Max JVM Heap on this graph shows 3GB whereas I configured for 4GB)

Any idea what could be wrong here?

That's super interesting. I sadly don't know.
What if you try with a smaller file than 1Gb?

I just know that the way FSCrawler is implemented today is may be not the best way.
For example, if you are using store_content option, it will load in memory the full file which is bad.

Normally only extracted data should be kept in memory I guess.

I am testing with different file size option in conjunction with FSCrawler JVM config - will share the results in a day or two.

I did some testing for sizing purpose and below are my findings. In summary, 10MB of single file ingestion works well but anything beyond that doesn't work well.

Test Case# File Size FSCrawler - Java Heap Result Comments
1 500MB 512MB Failed Failed with "Out of Memory"
2 500MB 1GB Failed Failed with "Out of Memory"
3 500MB 1.5GB Failed Failed with "Out of Memory"
4 500MB 2GB Failed Failed with "Out of Memory"
5 250MB 512MB Failed Failed with "Out of Memory"
6 250MB 1GB Failed Failed with "Out of Memory"
7 250MB 1.5GB Failed Failed with "Out of Memory"
8 250MB 2GB Failed Failed with "Out of Memory"
9 100MB 512MB Failed Failed with "Out of Memory"
10 100MB 1GB Failed Failed with "Out of Memory"
11 100MB 1.5GB Failed Failed with "Out of Memory"
12 100MB 2GB Success file ingestion goes through but it takes couple of minutes before you can see content via Kibana UI under Discover tab - reason, it is trying to pull in all the content for this big file. Also, when you try to search something within this big file, it takes lot of time and most likely, it will timeout after 30secs (30K milliseconds). Search within other documents which are part of same index does not take time
13 80MB 512MB Failed Failed with "Out of Memory"
14 80MB 1GB Failed Failed with "Out of Memory"
15 80MB 1.5GB Success file ingestion goes through but it takes couple of minutes before you can see content via Kibana UI under Discover tab - reason, it is trying to pull in all the content for this big file. Also, when you try to search something within this big file, it takes lot of time and most likely, it will timeout after 30secs (30K milliseconds). Search within other documents which are part of same index does not take time
16 80MB 2GB Success file ingestion goes through but it takes couple of minutes before you can see content via Kibana UI under Discover tab - reason, it is trying to pull in all the content for this big file. Also, when you try to search something within this big file, it takes lot of time and most likely, it will timeout after 30secs (30K milliseconds). Search within other documents which are part of same index does not take time
17 50MB 512MB Failed Failed with "Out of Memory"
18 50MB 1GB Success viewing through Kibana UI (via Discovery) is still time consuming - it took a minute before anything showed up on the UI
19 25MB 512MB Success Slowness observed in searching and when pulling content via Discover tab on Kibana UI (took about 30-40secs)
20 10MB 512MB Success slight slowness still observed. Took about 15-20secs searching content in the file
1 Like

That's an awesome report. Thanks a lot for doing it!

Would you mind opening an issue in FSCrawler project with those details?
And if possible could you share the files you used to test that or a way to recreate those files (ie a Linux script)?

Thanks a lot!

Issue has been create - Ingestion of more than 10MB single file ingestion fails · Issue #566 · dadoonet/fscrawler · GitHub

In order to simulate this test at your end, just create dummy file. I created the same using windows batch command which is below:

echo "This is just a sample line appended to create a big file.. " > dummy.txt
for /L %i in (1,1,14) do type dummy.txt >> dummy.txt

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.