FSCrawler - Indexing mix of Big and small files - HTTP Entity too large error

I have splitted a large text file into multiple files of 50 MB each. When the setting of bulk_size is 1 in Fscrawler _settings.json, each file is indexed into Elasticsearch without any error. If the bulk size is increased I am getting the HTTP 413 Request Entity Too Large error.
When I have a folder with small files and when the bulk_size settings is 100, then also it is indexed perfectly.

In some cases, I have a mix of large text files (50 MB each) and small files. If i keep the bulk_size is 1. It is also indexing small files in a batch of 1 and the indexing time is increased.

Is it possible to have a size range parameter so that the indexing time will be reduced. As of now we have "ignore_above" parameter so I can run file size below 20 MB (Example) with a "bulk_size" of 20. Then, if we have a range parameter I can again run the same folder with file size above 20 MB and upto 50 Mb with bulk_size of 1.

Also please let me know is there any other way to solve this issue.

Welcome!

Could you share your full _settings.json file please?

Please find below the _settings.json file:

{
  "name" : "fs-test-2024-001",
  "fs" : {
    "url" : "E:/Data/crawler_data/Test",
    "update_rate" : "12h",
	"includes" : [
		"*/*.jpg",
		"*/*.jpeg",
		"*/*.png",
		"*/*.doc",
		"*/*.docx",
		"*/*.pdf",
		"*/*.txt",
		"*/*.sql"
		],
    "excludes" : [ 
		"*/*.zip",
		"*/*.rar",
		"*/*.exe",
		"*/*.mp4",
		"*/*.mp3"
	],
	"json_support" : false,
	"xml_support" : false,
	"add_as_inner_object" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : false,
    "index_folders" : false,
    "lang_detect" : false,
    "continue_on_error" : false,
	"indexed_chars" : "-1",
	"ignore_above": "50mb",
	"checksum": "MD5",
	"ocr" : {
      "language" : "eng",
      "enabled"  : true,
	  "pdf_strategy": "ocr_and_text"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
	  "url": "https://localhost:9200"
    } ],
	"bulk_size": 1,
    "flush_interval": "5s"
	"username": "elastic",
	"password": "",
	"index": "fs-test-2024-001"
  },
  "rest" : {
	"url": "http://127.0.0.1:8080/fscrawler"
  }
}

If i change the bulk_size: 4, i am getting the below error.

02:29:13,612 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [fs-dark-2024-001]
02:29:13,612 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
02:29:13,614 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing Elasticsearch client manager
02:29:13,616 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Closing BulkProcessor
02:29:13,617 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] BulkProcessor is now closed
02:29:13,620 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service stopped
02:29:13,620 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing Elasticsearch client manager
02:29:13,620 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Closing BulkProcessor
02:29:13,621 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] BulkProcessor is now closed
02:29:13,621 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Executing [4] remaining actions
02:29:13,622 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Going to execute new bulk composed of 4 actions
02:29:16,568 DEBUG [f.p.e.c.f.c.ElasticsearchEngine] Sending a bulk request of [4] documents to the Elasticsearch service
02:29:16,735 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk a ndjson of 230246338 characters
02:29:17,815 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Error while running POST https://localhost:9200/_bulk:
02:29:17,815 WARN  [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Error executing bulk
jakarta.ws.rs.ClientErrorException: HTTP 413 Request Entity Too Large
        at org.glassfish.jersey.client.JerseyInvocation.createExceptionForFamily(JerseyInvocation.java:985) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:967) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:755) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$1(JerseyInvocation.java:675) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.call(JerseyInvocation.java:697) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.lambda$runInScope$3(JerseyInvocation.java:691) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:292) ~[jersey-common-3.1.5.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:274) ~[jersey-common-3.1.5.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:205) ~[jersey-common-3.1.5.jar:?]
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:390) ~[jersey-common-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.runInScope(JerseyInvocation.java:691) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:674) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:450) ~[jersey-client-3.1.5.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchClient.httpCall(ElasticsearchClient.java:871) ~[fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchClient.httpPost(ElasticsearchClient.java:847) ~[fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchClient.bulk(ElasticsearchClient.java:808) ~[fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchEngine.bulk(ElasticsearchEngine.java:82) ~[fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchEngine.bulk(ElasticsearchEngine.java:31) ~[fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerBulkProcessor.execute(FsCrawlerBulkProcessor.java:146) [fscrawler-framework-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerBulkProcessor.internalClose(FsCrawlerBulkProcessor.java:101) [fscrawler-framework-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerBulkProcessor.close(FsCrawlerBulkProcessor.java:77) [fscrawler-framework-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchClient.close(ElasticsearchClient.java:452) [fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.service.FsCrawlerDocumentServiceElasticsearchImpl.close(FsCrawlerDocumentServiceElasticsearchImpl.java:60) [fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl.close(FsCrawlerImpl.java:170) [fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.runner(FsCrawlerCli.java:399) [fscrawler-cli-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:119) [fscrawler-cli-2.10-SNAPSHOT.jar:?]
02:29:17,832 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Elasticsearch Document Service stopped

The problem is here:

"indexed_chars" : "-1"

You are asking to extract the whole text content. Not sure how much it represents actually but that's may be too much anyway.
There's a limit on Elasticsearch side, which is by default 100mb. And I would not recommend increasing this limit unless you know exactly what you are doing.

Instead of using bulk_size, you could use byte_size:

elasticsearch.byte_size: 80mb

See Elasticsearch settings — FSCrawler 2.10-SNAPSHOT documentation

HTH

Yes, I am aware of the limit of 100 MB on elasticsearch side. It is a big text file and it is splitted into multiple files of 50 MB. I have also tested with byte_size of 60 MB and removed the bulk_size. In this case also i am getting the error.

As i have mentioned earlier, if bulk_size is 1, it is successful

So I'd suggest that you split the text in smaller pieces, like 10mb each.

I have a query, if the byte_size is 60 MB and we didn't mention bulk_size, how much document should be indexed at a time

Hi David,

There are 2 settings bulk_size and byte_size. If I keep bulk_size to say 50 and the byte_size as 80mb.
While adding the files for Bulk Index, is it possible for us to check whether byte_size limit is crossed. If it is crossed after 35th file, is it possible for us to bulk index the 35 files alone and the remaining files will be indexed in the next batch.

I think there's an issue. The byte_size limit is totally ignored...
Could you open an issue?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.