Default value http.content_content_length does not restricts ingestion of large documents

sachinkarora · July 6, 2018, 8:37am

I am using FSCrawler to perform rich documents (PDF/PPT/DOCX etc.) ingestion into ElasticSearch and while doing some validation to check how much big file i can ingest, i tried ingesting a TXT file of 1GB and 2GB as well and to my surprise, ingestion went fine even though ElasticSearch configuration parameter HTTP.content_max_length was set to 100MB. Doesn't this parameter restricts the size of content being ingested into ElasticSearch?

Appreciate if anyone can provide their inputs on this.

Thanks
Sachin

dadoonet · July 6, 2018, 9:27am

How big looks the json document generated by FSCrawler?
By default it extracts not all characters but 10000 (IIRC).

sachinkarora · July 6, 2018, 9:28am

I am extracting 100% document - have indexed_chars value set to 100%

dadoonet · July 6, 2018, 9:49am

When you do a GET for this document, is it sending you back a json of 1gb size?

sachinkarora · July 6, 2018, 10:36am

when i fetch the document, it shows me the content but i think it may not be complete document because i see some text is cut out towards the end.

Since all lines in this text file are same (because content was created through a batch program), it is hard to identify how much content is cut out

FYI - this text file was created using below command for testing:

echo "This is just a sample line appended to create a big file.. " > dummy1G.txt
for /L %i in (1,1,10) do type dummy.txt >> dummy1G.txt

dadoonet · July 6, 2018, 10:49am

But is the json you are getting back is a valid json? I guess it is.

So I think the limit of 100% does not work most likely. I'd use -1 instead.

sachinkarora · July 9, 2018, 4:59am

I believe there was something wrong with my _settings.json file, so i created a new one and now the ingestion of 1GB size of TXT file is failing with "Out of Memory error" - below is the error

09:33:20,664 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [1119734816] characters of text for [dummy1G.txt]
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Unknown Source) ~[?:1.8.0_162]
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(Unknown Source) ~[?:1.8.0_162]
        at java.lang.AbstractStringBuilder.append(Unknown Source) ~[?:1.8.0_162]
        at java.lang.StringBuffer.append(Unknown Source) ~[?:1.8.0_162]
        at java.io.StringWriter.write(Unknown Source) ~[?:1.8.0_162]
        at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:93) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:136) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:279) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:104) ~[tika-parsers-1.16.jar:1.16]
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ~[tika-core-1.16.jar:1.16]
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) ~[tika-core-1.16.jar:1.16]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:124) ~[fscrawler-2.4.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:91) [fscrawler-2.4.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.indexFile(FsCrawlerImpl.java:671) [fscrawler-2.4.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.addFilesRecursively(FsCrawlerImpl.java:460) [fscrawler-2.4.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl$FSParser.run(FsCrawlerImpl.java:344) [fscrawler-2.4.jar:?]
        at java.lang.Thread.run(Unknown Source) [?:1.8.0_162]

I tried allocating more JVM to FSCrawler by setting the environment variable "FS_JAVA_OPTS" to 3GB but the ingestion still fails with "Out of Memory" error

set FS_JAVA_OPTS="-Xmx3072m"

When i drilled deeper by monitoring the heap usage via VisualVM tool, there was interesting observation.

Per VisualVM tool, it confirms that by setting above environment variable, JVM max heap has been allocated accordingly.

But then, if i look at the max heap that was consumed while ingestion was in-progress, it didn't go beyond 2GB - not sure why ( (interestingly, Max JVM Heap on this graph shows 2GB whereas I configured for 3GB)

I performed another test with 4GB max JVM allocation but the result was same - it didn't go beyond ~2GB (interestingly, Max JVM Heap on this graph shows 3GB whereas I configured for 4GB)

Any idea what could be wrong here?

dadoonet · July 9, 2018, 7:20am

That's super interesting. I sadly don't know.
What if you try with a smaller file than 1Gb?

I just know that the way FSCrawler is implemented today is may be not the best way.
For example, if you are using store_content option, it will load in memory the full file which is bad.

Normally only extracted data should be kept in memory I guess.

sachinkarora · July 10, 2018, 6:44am

I am testing with different file size option in conjunction with FSCrawler JVM config - will share the results in a day or two.

sachinkarora · July 23, 2018, 10:50am

I did some testing for sizing purpose and below are my findings. In summary, 10MB of single file ingestion works well but anything beyond that doesn't work well.

Test Case#	File Size	FSCrawler - Java Heap	Result	Comments
1	500MB	512MB	Failed	Failed with "Out of Memory"
2	500MB	1GB	Failed	Failed with "Out of Memory"
3	500MB	1.5GB	Failed	Failed with "Out of Memory"
4	500MB	2GB	Failed	Failed with "Out of Memory"
5	250MB	512MB	Failed	Failed with "Out of Memory"
6	250MB	1GB	Failed	Failed with "Out of Memory"
7	250MB	1.5GB	Failed	Failed with "Out of Memory"
8	250MB	2GB	Failed	Failed with "Out of Memory"
9	100MB	512MB	Failed	Failed with "Out of Memory"
10	100MB	1GB	Failed	Failed with "Out of Memory"
11	100MB	1.5GB	Failed	Failed with "Out of Memory"
12	100MB	2GB	Success	file ingestion goes through but it takes couple of minutes before you can see content via Kibana UI under Discover tab - reason, it is trying to pull in all the content for this big file. Also, when you try to search something within this big file, it takes lot of time and most likely, it will timeout after 30secs (30K milliseconds). Search within other documents which are part of same index does not take time
13	80MB	512MB	Failed	Failed with "Out of Memory"
14	80MB	1GB	Failed	Failed with "Out of Memory"
15	80MB	1.5GB	Success	file ingestion goes through but it takes couple of minutes before you can see content via Kibana UI under Discover tab - reason, it is trying to pull in all the content for this big file. Also, when you try to search something within this big file, it takes lot of time and most likely, it will timeout after 30secs (30K milliseconds). Search within other documents which are part of same index does not take time
16	80MB	2GB	Success	file ingestion goes through but it takes couple of minutes before you can see content via Kibana UI under Discover tab - reason, it is trying to pull in all the content for this big file. Also, when you try to search something within this big file, it takes lot of time and most likely, it will timeout after 30secs (30K milliseconds). Search within other documents which are part of same index does not take time
17	50MB	512MB	Failed	Failed with "Out of Memory"
18	50MB	1GB	Success	viewing through Kibana UI (via Discovery) is still time consuming - it took a minute before anything showed up on the UI
19	25MB	512MB	Success	Slowness observed in searching and when pulling content via Discover tab on Kibana UI (took about 30-40secs)
20	10MB	512MB	Success	slight slowness still observed. Took about 15-20secs searching content in the file

dadoonet · July 23, 2018, 11:12am

That's an awesome report. Thanks a lot for doing it!

Would you mind opening an issue in FSCrawler project with those details?
And if possible could you share the files you used to test that or a way to recreate those files (ie a Linux script)?

Thanks a lot!

sachinkarora · July 25, 2018, 9:48am

Issue has been create - Ingestion of more than 10MB single file ingestion fails · Issue #566 · dadoonet/fscrawler · GitHub

In order to simulate this test at your end, just create dummy file. I created the same using windows batch command which is below:

echo "This is just a sample line appended to create a big file.. " > dummy.txt
for /L %i in (1,1,14) do type dummy.txt >> dummy.txt

system · August 22, 2018, 9:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Request Entity Too Large when index file json has size large 100mb Elasticsearch	5	1842	November 6, 2019
FSCrawler Ingest pdf error Exceeds maximum allowed document size of 102400 bytes Elastic Search elastic-workplace-search	11	2065	October 31, 2022
Elasticsearch Max document length for indexing files Elasticsearch	4	551	May 15, 2019
Max doc size for indexing over HTTP Elasticsearch	2	1420	July 6, 2017
ES + Attachment --> indexed documents incomplete Elasticsearch	11	637	July 6, 2017

Default value http.content_content_length does not restricts ingestion of large documents

Related topics