I am using FSCrawler to perform rich documents (PDF/PPT/DOCX etc.) ingestion into ElasticSearch and while doing some validation to check how much big file i can ingest, i tried ingesting a TXT file of 1GB and 2GB as well and to my surprise, ingestion went fine even though ElasticSearch configuration parameter HTTP.content_max_length was set to 100MB. Doesn't this parameter restricts the size of content being ingested into ElasticSearch?
Appreciate if anyone can provide their inputs on this.
I believe there was something wrong with my _settings.json file, so i created a new one and now the ingestion of 1GB size of TXT file is failing with "Out of Memory error" - below is the error
09:33:20,664 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [1119734816] characters of text for [dummy1G.txt]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Unknown Source) ~[?:1.8.0_162]
at java.lang.AbstractStringBuilder.ensureCapacityInternal(Unknown Source) ~[?:1.8.0_162]
at java.lang.AbstractStringBuilder.append(Unknown Source) ~[?:1.8.0_162]
at java.lang.StringBuffer.append(Unknown Source) ~[?:1.8.0_162]
at java.io.StringWriter.write(Unknown Source) ~[?:1.8.0_162]
at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:104) ~[tika-parsers-1.16.jar:1.16]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.16.jar:1.16]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[tika-core-1.16.jar:1.16]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:124) ~[fscrawler-2.4.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:91) [fscrawler-2.4.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.indexFile(FsCrawlerImpl.java:671) [fscrawler-2.4.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.addFilesRecursively(FsCrawlerImpl.java:460) [fscrawler-2.4.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.run(FsCrawlerImpl.java:344) [fscrawler-2.4.jar:?]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_162]
I tried allocating more JVM to FSCrawler by setting the environment variable "FS_JAVA_OPTS" to 3GB but the ingestion still fails with "Out of Memory" error
set FS_JAVA_OPTS="-Xmx3072m"
When i drilled deeper by monitoring the heap usage via VisualVM tool, there was interesting observation.
Per VisualVM tool, it confirms that by setting above environment variable, JVM max heap has been allocated accordingly.
But then, if i look at the max heap that was consumed while ingestion was in-progress, it didn't go beyond 2GB - not sure why ( (interestingly, Max JVM Heap on this graph shows 2GB whereas I configured for 3GB)
I performed another test with 4GB max JVM allocation but the result was same - it didn't go beyond ~2GB (interestingly, Max JVM Heap on this graph shows 3GB whereas I configured for 4GB)
That's super interesting. I sadly don't know.
What if you try with a smaller file than 1Gb?
I just know that the way FSCrawler is implemented today is may be not the best way.
For example, if you are using store_content option, it will load in memory the full file which is bad.
Normally only extracted data should be kept in memory I guess.
I did some testing for sizing purpose and below are my findings. In summary, 10MB of single file ingestion works well but anything beyond that doesn't work well.
Test Case#
File Size
FSCrawler - Java Heap
Result
Comments
1
500MB
512MB
Failed
Failed with "Out of Memory"
2
500MB
1GB
Failed
Failed with "Out of Memory"
3
500MB
1.5GB
Failed
Failed with "Out of Memory"
4
500MB
2GB
Failed
Failed with "Out of Memory"
5
250MB
512MB
Failed
Failed with "Out of Memory"
6
250MB
1GB
Failed
Failed with "Out of Memory"
7
250MB
1.5GB
Failed
Failed with "Out of Memory"
8
250MB
2GB
Failed
Failed with "Out of Memory"
9
100MB
512MB
Failed
Failed with "Out of Memory"
10
100MB
1GB
Failed
Failed with "Out of Memory"
11
100MB
1.5GB
Failed
Failed with "Out of Memory"
12
100MB
2GB
Success
file ingestion goes through but it takes couple of minutes before you can see content via Kibana UI under Discover tab - reason, it is trying to pull in all the content for this big file. Also, when you try to search something within this big file, it takes lot of time and most likely, it will timeout after 30secs (30K milliseconds). Search within other documents which are part of same index does not take time
13
80MB
512MB
Failed
Failed with "Out of Memory"
14
80MB
1GB
Failed
Failed with "Out of Memory"
15
80MB
1.5GB
Success
file ingestion goes through but it takes couple of minutes before you can see content via Kibana UI under Discover tab - reason, it is trying to pull in all the content for this big file. Also, when you try to search something within this big file, it takes lot of time and most likely, it will timeout after 30secs (30K milliseconds). Search within other documents which are part of same index does not take time
16
80MB
2GB
Success
file ingestion goes through but it takes couple of minutes before you can see content via Kibana UI under Discover tab - reason, it is trying to pull in all the content for this big file. Also, when you try to search something within this big file, it takes lot of time and most likely, it will timeout after 30secs (30K milliseconds). Search within other documents which are part of same index does not take time
17
50MB
512MB
Failed
Failed with "Out of Memory"
18
50MB
1GB
Success
viewing through Kibana UI (via Discovery) is still time consuming - it took a minute before anything showed up on the UI
19
25MB
512MB
Success
Slowness observed in searching and when pulling content via Discover tab on Kibana UI (took about 30-40secs)
20
10MB
512MB
Success
slight slowness still observed. Took about 15-20secs searching content in the file
That's an awesome report. Thanks a lot for doing it!
Would you mind opening an issue in FSCrawler project with those details?
And if possible could you share the files you used to test that or a way to recreate those files (ie a Linux script)?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.