FSCrawler - Indexing mix of Big and small files - HTTP Entity too large error

kamalsharma · January 12, 2024, 11:54am

I have splitted a large text file into multiple files of 50 MB each. When the setting of bulk_size is 1 in Fscrawler _settings.json, each file is indexed into Elasticsearch without any error. If the bulk size is increased I am getting the HTTP 413 Request Entity Too Large error.
When I have a folder with small files and when the bulk_size settings is 100, then also it is indexed perfectly.

In some cases, I have a mix of large text files (50 MB each) and small files. If i keep the bulk_size is 1. It is also indexing small files in a batch of 1 and the indexing time is increased.

Is it possible to have a size range parameter so that the indexing time will be reduced. As of now we have "ignore_above" parameter so I can run file size below 20 MB (Example) with a "bulk_size" of 20. Then, if we have a range parameter I can again run the same folder with file size above 20 MB and upto 50 Mb with bulk_size of 1.

Also please let me know is there any other way to solve this issue.

dadoonet · January 12, 2024, 12:24pm

Welcome!

Could you share your full _settings.json file please?

kamalsharma · January 12, 2024, 12:37pm

Please find below the _settings.json file:

{
  "name" : "fs-test-2024-001",
  "fs" : {
    "url" : "E:/Data/crawler_data/Test",
    "update_rate" : "12h",
	"includes" : [
		"*/*.jpg",
		"*/*.jpeg",
		"*/*.png",
		"*/*.doc",
		"*/*.docx",
		"*/*.pdf",
		"*/*.txt",
		"*/*.sql"
		],
    "excludes" : [ 
		"*/*.zip",
		"*/*.rar",
		"*/*.exe",
		"*/*.mp4",
		"*/*.mp3"
	],
	"json_support" : false,
	"xml_support" : false,
	"add_as_inner_object" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "store_source" : false,
    "index_content" : true,
    "attributes_support" : false,
    "raw_metadata" : false,
    "index_folders" : false,
    "lang_detect" : false,
    "continue_on_error" : false,
	"indexed_chars" : "-1",
	"ignore_above": "50mb",
	"checksum": "MD5",
	"ocr" : {
      "language" : "eng",
      "enabled"  : true,
	  "pdf_strategy": "ocr_and_text"
    }
  },
  "elasticsearch" : {
    "nodes" : [ {
	  "url": "https://localhost:9200"
    } ],
	"bulk_size": 1,
    "flush_interval": "5s"
	"username": "elastic",
	"password": "",
	"index": "fs-test-2024-001"
  },
  "rest" : {
	"url": "http://127.0.0.1:8080/fscrawler"
  }
}

If i change the bulk_size: 4, i am getting the below error.

02:29:13,612 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [fs-dark-2024-001]
02:29:13,612 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
02:29:13,614 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing Elasticsearch client manager
02:29:13,616 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Closing BulkProcessor
02:29:13,617 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] BulkProcessor is now closed
02:29:13,620 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service stopped
02:29:13,620 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Closing Elasticsearch client manager
02:29:13,620 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Closing BulkProcessor
02:29:13,621 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] BulkProcessor is now closed
02:29:13,621 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Executing [4] remaining actions
02:29:13,622 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Going to execute new bulk composed of 4 actions
02:29:16,568 DEBUG [f.p.e.c.f.c.ElasticsearchEngine] Sending a bulk request of [4] documents to the Elasticsearch service
02:29:16,735 DEBUG [f.p.e.c.f.c.ElasticsearchClient] bulk a ndjson of 230246338 characters
02:29:17,815 DEBUG [f.p.e.c.f.c.ElasticsearchClient] Error while running POST https://localhost:9200/_bulk:
02:29:17,815 WARN  [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Error executing bulk
jakarta.ws.rs.ClientErrorException: HTTP 413 Request Entity Too Large
        at org.glassfish.jersey.client.JerseyInvocation.createExceptionForFamily(JerseyInvocation.java:985) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.convertToException(JerseyInvocation.java:967) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.translate(JerseyInvocation.java:755) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$1(JerseyInvocation.java:675) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.call(JerseyInvocation.java:697) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.lambda$runInScope$3(JerseyInvocation.java:691) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:292) ~[jersey-common-3.1.5.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:274) ~[jersey-common-3.1.5.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:205) ~[jersey-common-3.1.5.jar:?]
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:390) ~[jersey-common-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.runInScope(JerseyInvocation.java:691) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation.invoke(JerseyInvocation.java:674) ~[jersey-client-3.1.5.jar:?]
        at org.glassfish.jersey.client.JerseyInvocation$Builder.method(JerseyInvocation.java:450) ~[jersey-client-3.1.5.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchClient.httpCall(ElasticsearchClient.java:871) ~[fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchClient.httpPost(ElasticsearchClient.java:847) ~[fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchClient.bulk(ElasticsearchClient.java:808) ~[fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchEngine.bulk(ElasticsearchEngine.java:82) ~[fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchEngine.bulk(ElasticsearchEngine.java:31) ~[fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerBulkProcessor.execute(FsCrawlerBulkProcessor.java:146) [fscrawler-framework-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerBulkProcessor.internalClose(FsCrawlerBulkProcessor.java:101) [fscrawler-framework-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.framework.bulk.FsCrawlerBulkProcessor.close(FsCrawlerBulkProcessor.java:77) [fscrawler-framework-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.client.ElasticsearchClient.close(ElasticsearchClient.java:452) [fscrawler-elasticsearch-client-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.service.FsCrawlerDocumentServiceElasticsearchImpl.close(FsCrawlerDocumentServiceElasticsearchImpl.java:60) [fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.FsCrawlerImpl.close(FsCrawlerImpl.java:170) [fscrawler-core-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.runner(FsCrawlerCli.java:399) [fscrawler-cli-2.10-SNAPSHOT.jar:?]
        at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:119) [fscrawler-cli-2.10-SNAPSHOT.jar:?]
02:29:17,832 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceElasticsearchImpl] Elasticsearch Document Service stopped

dadoonet · January 12, 2024, 12:44pm

The problem is here:

"indexed_chars" : "-1"

You are asking to extract the whole text content. Not sure how much it represents actually but that's may be too much anyway.
There's a limit on Elasticsearch side, which is by default 100mb. And I would not recommend increasing this limit unless you know exactly what you are doing.

Instead of using bulk_size, you could use byte_size:

elasticsearch.byte_size: 80mb

See Elasticsearch settings — FSCrawler 2.10-SNAPSHOT documentation

HTH

kamalsharma · January 12, 2024, 12:50pm

Yes, I am aware of the limit of 100 MB on elasticsearch side. It is a big text file and it is splitted into multiple files of 50 MB. I have also tested with byte_size of 60 MB and removed the bulk_size. In this case also i am getting the error.

As i have mentioned earlier, if bulk_size is 1, it is successful

dadoonet · January 12, 2024, 12:52pm

So I'd suggest that you split the text in smaller pieces, like 10mb each.

kamalsharma · January 12, 2024, 12:53pm

I have a query, if the byte_size is 60 MB and we didn't mention bulk_size, how much document should be indexed at a time

kamalsharma · January 22, 2024, 1:54pm

Hi David,

There are 2 settings bulk_size and byte_size. If I keep bulk_size to say 50 and the byte_size as 80mb.
While adding the files for Bulk Index, is it possible for us to check whether byte_size limit is crossed. If it is crossed after 35th file, is it possible for us to bulk index the 35 files alone and the remaining files will be indexed in the next batch.

dadoonet · January 31, 2024, 5:12pm

I think there's an issue. The byte_size limit is totally ignored...
Could you open an issue?

system · February 28, 2024, 5:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Request Entity Too Large when index file json has size large 100mb Elasticsearch	5	1703	November 6, 2019
Max content length issue Elasticsearch	6	8515	April 24, 2018
Elasticsearch.Net.ElasticsearchClientException: The remote server returned an error: (413) Request Entity Too Large Elasticsearch	4	7098	February 7, 2019
FSCrawler large document and indexing based on content Elasticsearch	4	2354	December 28, 2017
Error while indexing documents into ES using Fscrawler Elasticsearch	6	2587	December 9, 2018

FSCrawler - Indexing mix of Big and small files - HTTP Entity too large error

Related topics