FSCrawler Ingest pdf error Exceeds maximum allowed document size of 102400 bytes

jeje232 · May 14, 2021, 5:38pm

Hello,

I want to use FSCrawler to push my pdf books on Workplace Search.
I tried with different bulk_size and flush_interval but no way. I have the same maximum allowed document size error.

My traces logs are:

18:51:17,039 DEBUG [f.p.e.c.f.FsParserAbstract] [/Other/romeo_and_juliet.pdf] can be indexed: [true]
18:51:17,039 DEBUG [f.p.e.c.f.FsParserAbstract]   - file: /Other/romeo_and_juliet.pdf
18:51:17,040 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/srv/es/Other],[romeo_and_juliet.pdf]
18:51:17,040 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/srv/es, /srv/es/Other/romeo_and_juliet.pdf) = /Other/romeo_and_juliet.pdf
18:51:17,040 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [/srv/es/Other/romeo_and_juliet.pdf]
18:51:17,040 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
18:51:18,270 DEBUG [f.p.e.c.f.t.w.WPSearchBulkResponse] {"results":[{"id":"a4ba2a663e56d88185216a513553dc55","errors":[]},{"id":"f83f53e7451650d84d2a8e961ac4bf5c","errors":["Exceeds maximum allowed document size of 102400 bytes"]}]}
18:51:18,270 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Executed bulk composed of 2 actions
18:51:18,437 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
18:51:18,437 TRACE [f.p.e.c.f.t.TikaDocParser] Listing all available metadata:
18:51:18,437 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw.entrySet(), iterableWithSize(37));
18:51:18,437 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:unmappedUnicodeCharsPerPage", "0"));
18:51:18,437 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:PDFVersion", "1.4"));
18:51:18,438 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:docinfo:title", "Romeo and Juliet"));
18:51:18,438 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("xmp:CreatorTool", "calibre 2.53.0 [http://calibre-ebook.com]"));
18:51:18,439 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:hasXFA", "false"));
18:51:18,439 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("access_permission:modify_annotations", "true"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("access_permission:can_print_degraded", "true"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dc:creator", "William Shakespeare"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dcterms:created", "2017-02-25T04:34:55Z"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dc:format", "application/pdf; version=1.4"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("title", "Romeo and Juliet"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:docinfo:creator_tool", "calibre 2.53.0 [http://calibre-ebook.com]"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("access_permission:fill_in_form", "true"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:encrypted", "false"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("dc:title", "Romeo and Juliet"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:hasMarkedContent", "false"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Content-Type", "application/pdf"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:docinfo:creator", "William Shakespeare"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("X-Parsed-By", "org.apache.tika.parser.pdf.PDFParser"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("creator", "William Shakespeare"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("meta:author", "William Shakespeare"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("meta:creation-date", "2017-02-25T04:34:55Z"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("created", "2017-02-25T04:34:55Z"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("access_permission:extract_for_accessibility", "true"));
18:51:18,442 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("access_permission:assemble_document", "true"));
18:51:18,442 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("xmpTPg:NPages", "143"));
18:51:18,442 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Creation-Date", "2017-02-25T04:34:55Z"));
18:51:18,442 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("resourceName", "romeo_and_juliet.pdf"));
18:51:18,442 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:hasXMP", "true"));
18:51:18,442 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:charsPerPage", "47"));
18:51:18,443 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("access_permission:extract_content", "true"));
18:51:18,443 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("access_permission:can_print", "true"));
18:51:18,443 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("Author", "William Shakespeare"));
18:51:18,443 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("producer", "calibre 2.53.0 [http://calibre-ebook.com]"));
18:51:18,443 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("access_permission:can_modify", "true"));
18:51:18,444 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:docinfo:producer", "calibre 2.53.0 [http://calibre-ebook.com]"));
18:51:18,444 TRACE [f.p.e.c.f.t.TikaDocParser]   assertThat(raw, hasEntry("pdf:docinfo:created", "2017-02-25T04:34:55Z"));
18:51:18,445 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
18:51:18,445 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
18:51:18,445 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/srv/es, /srv/es/Other/romeo_and_juliet.pdf) = /Other/romeo_and_juliet.pdf
18:51:18,445 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceWorkplaceSearchImpl] Indexing workplace/a3c830b8c88bda934b746d14be7167c?pipeline=null
18:51:18,446 DEBUG [f.p.e.c.f.t.w.WPSearchClient] Adding document {extension=pdf, comments=null, keywords=null, author=William Shakespeare, created_at=2021-05-14T18:51:00+02:00, language=null, body=

... here the 100 first pages ...

C, title=Romeo and Juliet, url=http://127.0.0.1/Other/romeo_and_juliet.pdf, path=/srv/es/Other/romeo_and_juliet.pdf, size=383706, text_size=null, mime_type=application/pdf, name=romeo_and_juliet.pdf, id=a3c830b8c88bda934b746d14be7167c, last_modified=2021-05-14T18:51:00+02:00}
C, title=Romeo and Juliet, url=http://127.0.0.1/Other/romeo_and_juliet.pdf, path=/srv/es/Other/romeo_and_juliet.pdf, size=383706, text_size=null, mime_type=application/pdf, name=romeo_and_juliet.pdf, id=a3c830b8c88bda934b746d14be7167c, last_modified=2021-05-14T18:51:00+02:00}
...
18:51:19,323 DEBUG [f.p.e.c.f.t.w.WPSearchBulkResponse] {"results":[{"id":"f728d2aec73a89f63bed4a66a51d6c37","errors":[]},{"id":"a3c830b8c88bda934b746d14be7167c","errors":["Exceeds maximum allowed document size of 102400 bytes"]}]}

I tried to remove in the _settings.yaml of my FSCrawler configuration the workplace_search and just keep the elasticsearch (my http.max_content_length is set to 400mb) and It work! I can see 100 pages of my book in the workplace index.

I don't now if it's possible to split the pdf to not grow any document out 102400 bytes or if it's possible to extend the max-content-lenght for my enterprise search. I read in another place it's not possible but in case I prefer asking.

Thank you very much and best regards

Sean_Story · May 14, 2021, 7:12pm

It is configurable! Look to increase workplace_search.custom_api_source.document_size.limit in your enterprise-search.yml. As you've discovered, the default is 102400 (100KB), but you'd probably be safe up to 10240000 (10MB). Just remember that, if you're sending 100 documents at a time, each at 10MB, you're trying to post a full GB of data, and your performance will suffer accordingly.

jeje232 · May 14, 2021, 7:43pm

Hi,

I was confused, I don't now why I thinking about in Enterprisearch limit couldn't exceed 102400. Could bee smaller but not higher.
For the 7.12.1 I see a breacking change:
workplace_search.content_source.document_size.limit

I will trying and give you a feedback.

Thank you very much...

jeje232 · May 14, 2021, 8:14pm

Hello,

I confirm, adding workplace_search.content_source.document_size.limit=1000kb solve the limit problem.

Thank you very much Sean for your help!

dadoonet · May 14, 2021, 9:30pm

@jeje232 I'd love if you could send a documentation PR to FSCrawler to document this.

jeje232 · May 15, 2021, 9:56am

Hi David,

I want to say you personally good work and many thanks for your import in FSCrawler.

It's not working at all. It not sending all my book just 100 first pages and give errors but it's good step.

When I have everything working send you my steps.

jeje232 · May 15, 2021, 10:22am

I increase the FSCrawler indexed_chars and it fine all my book is sending and searching in Workplace Search but I follow with an error. Follow investigating...

jeje232 · May 15, 2021, 11:42am

With workplace_search section in my _settings.yaml my bin/fscrawler stop.

This is what I find in the logs:
FSCrawler:

13:19:52,618 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [488.3mb/522mb=93.56%], RAM [484.7mb/7.7gb=6.08%], Swap [1000mb/1021.9mb=97.85%].
13:19:53,046 TRACE [f.p.e.c.f.c.FsCrawlerCli] settings used for this crawler: [---
name: "workplace"
fs:
  url: "/srv/es"
  update_rate: "15m"
  excludes:
  - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  indexed_chars: "1000000.0"
  attributes_support: false
  raw_metadata: true
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  ocr:
    language: "eng+esp+fra"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - url: "http://127.0.0.1:9200"
  bulk_size: 1000
  flush_interval: "5s"
  byte_size: "100mb"
  username: "elastic"
  ssl_verification: true
workplace_search:
  server:
    url: "http://172.16.0.208:3002"
  key: "609facc39e00d07aea1c09f0"
  access_token: "8d3e48767cb4a09746d4fa59fee0784b6e2974a395b94162b3d0efffd4ac17c2"
  url_prefix: "https://fscrawler01.intranet.lan/books"
  bulk_size: 1000
  flush_interval: "5s"
]

13:19:53,048 WARN  [f.p.e.c.f.c.FsCrawlerCli] Workplace Search integration does not support yet watching a directory. It will be able to run only once and exit. We manually force from --loop -1 to --loop 1. If you want to remove this message next time, please start FSCrawler with --loop 1

13:19:53,993 TRACE [f.p.e.c.f.FsParserAbstract] FileAbstractModel = FileAbstractModel{name='romeo_and_juliet.pdf', file=true, directory=false, lastModifiedDate=2021-05-15T13:18:02.480083690, creationDate=2021-05-15T13:18:02.480083690, accessDate=2021-05-15T13:18:02.480083690, path='/srv/es/Other', owner='root', group='root', permissions=644, extension='pdf', fullpath='/srv/es/Other/romeo_and_juliet.pdf', size=383706}

13:19:54,499 WARN  [o.a.t.p.PDFParser] J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

13:19:56,615 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler is stopping after 1 run
13:19:56,645 DEBUG [f.p.e.c.f.FsCrawlerImpl] Closing FS crawler [workplace]
13:19:56,646 DEBUG [f.p.e.c.f.FsCrawlerImpl] FS crawler thread is now stopped
13:19:56,646 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] Closing Elasticsearch client manager
13:19:56,648 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Sending a bulk request of [1] requests
13:19:56,734 TRACE [f.p.e.c.f.c.v.ElasticsearchClientV7] Executed bulk request with [1] requests
13:19:56,747 DEBUG [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Elasticsearch Management Service stopped
13:19:56,747 DEBUG [f.p.e.c.f.c.v.WorkplaceSearchClientV7] Closing Workplace Search V7 client
13:19:56,747 DEBUG [f.p.e.c.f.c.v.ElasticsearchClientV7] Closing Elasticsearch client manager
13:19:56,749 DEBUG [f.p.e.c.f.t.w.WPSearchClient] Closing the WPSearchClient
13:19:56,749 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Closing BulkProcessor
13:19:56,749 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] BulkProcessor is now closed
13:19:56,750 DEBUG [f.p.e.c.f.f.b.FsCrawlerBulkProcessor] Executing [1] remaining actions
13:19:56,750 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Going to execute new bulk composed of 1 actions
13:19:57,959 DEBUG [f.p.e.c.f.t.w.WPSearchBulkResponse] {"results":[{"id":"a3c830b8c88bda934b746d14be7167c","errors":[]}]}
13:19:57,959 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Executed bulk composed of 1 actions
13:19:57,967 DEBUG [f.p.e.c.f.c.v.WorkplaceSearchClientV7] Workplace Search V7 client closed
13:19:57,967 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceWorkplaceSearchImpl] Workplace Search Document Service stopped
13:19:57,967 DEBUG [f.p.e.c.f.FsCrawlerImpl] ES Client Manager stopped
13:19:57,967 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler [workplace] stopped

EnterpriseSearch:

enterprisesearch_1  | [2021-05-15T11:19:43.976+00:00][8][2570][app-server][INFO]: [ee3ae784-3190-4eaa-8e05-594f9f5b4a01] Started GET "/ws/authenticate/my_user_info" for 172.16.0.154 at 2021-05-15 11:19:43 +0000
enterprisesearch_1  | [2021-05-15T11:19:43.984+00:00][8][2570][action_controller][INFO]: [ee3ae784-3190-4eaa-8e05-594f9f5b4a01] Processing by FritoPie::AuthenticateController#my_user_info as JSON
enterprisesearch_1  | [2021-05-15T11:19:44.033+00:00][8][2570][action_controller][INFO]: [ee3ae784-3190-4eaa-8e05-594f9f5b4a01] Completed 200 OK in 48ms (Views: 0.5ms)
enterprisesearch_1  | [2021-05-15T11:19:57.242+00:00][8][2570][app-server][INFO]: [97b33bc8-53ac-433c-8a0a-61f087b639a8] Started POST "/api/ws/v1/sources/609facc39e00d07aea1c09f0/documents/bulk_create" for 172.19.0.1 at 2021-05-15 11:19:57 +0000
enterprisesearch_1  | [2021-05-15T11:19:57.262+00:00][8][2570][action_controller][INFO]: [97b33bc8-53ac-433c-8a0a-61f087b639a8] Processing by Api::FritoPie::V1::DocumentsController#bulk_create as JSON
enterprisesearch_1  | [2021-05-15T11:19:57.263+00:00][8][2570][action_controller][INFO]: [97b33bc8-53ac-433c-8a0a-61f087b639a8]   Parameters: {"_json"=>"[FILTERED]", "content_source_id"=>"609facc39e00d07aea1c09f0"}
enterprisesearch_1  | [2021-05-15T11:19:57.708+00:00][8][2570][worker][INFO]: [97b33bc8-53ac-433c-8a0a-61f087b639a8] [ActiveJob] Enqueueing a job into the '.ent-search-esqueues-me_queue_v1_refresh_document_counts' index. {"job_type"=>"ActiveJob::QueueAdapters::EsqueuesMeAdapter::JobWrapper", "payload"=>{"args"=>[{"job_class"=>"Work::Engine::RefreshDocumentCounts", "job_id"=>"f67d905670e8b9cb49574215128d4c2999072991", "queue_name"=>"refresh_document_counts", "arguments"=>["609facc49e00d07aea1c09f1"], "locale"=>:en, "executions"=>1}]}, "status"=>"pending", "created_at"=>1621077597708, "perform_at"=>1621077657706, "attempts"=>0}
enterprisesearch_1  | [2021-05-15T11:19:57.759+00:00][8][2570][active_job][INFO]: [97b33bc8-53ac-433c-8a0a-61f087b639a8] [ActiveJob] Enqueued Work::Engine::RefreshDocumentCounts job (f67d905670e8b9cb49574215128d4c2999072991) on `refresh_document_counts`
enterprisesearch_1  | [2021-05-15T11:19:57.762+00:00][8][2570][active_job][INFO]: [97b33bc8-53ac-433c-8a0a-61f087b639a8] [ActiveJob] Enqueued Work::Engine::RefreshDocumentCounts (Job ID: f67d905670e8b9cb49574215128d4c2999072991) to EsqueuesMe(refresh_document_counts) at 2021-05-15 11:20:57 UTC with arguments: "609facc49e00d07aea1c09f1"
enterprisesearch_1  | [2021-05-15T11:19:57.931+00:00][8][2570][action_controller][INFO]: [97b33bc8-53ac-433c-8a0a-61f087b639a8] Completed 200 OK in 666ms (Views: 2.1ms)

All is look fine. I don't now why the process stop at the end.

dadoonet · May 15, 2021, 12:02pm

FSCrawler stops because it has sent all the documents.
The "watch" mode is not supported yet.

jeje232 · May 17, 2021, 6:03pm

Hello,

It's work perfectly.

I Testing with 60Mb PDF with 1418 pages with pictures etc with the previous configuration...
I just increase the indexed_chars to 10.000.000 lol... And follow working perfectly.

My Debian 10 VM have just 8 Gigs for everything. I put the JVM of the Elasticsearch at 4Gb and 2Gb for Enterprise-Search.

The searches are very fast.

Very nice solution for all the documentation.

Thank you all...

dadoonet · May 17, 2021, 9:40pm

You can set this setting to -1 and it will extract everything.

system · October 31, 2022, 2:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Default value http.content_content_length does not restricts ingestion of large documents Elasticsearch	12	988	August 22, 2018
Request Entity Too Large when index file json has size large 100mb Elasticsearch	5	1842	November 6, 2019
Elasticsearch Max document length for indexing files Elasticsearch	4	551	May 15, 2019
Indexing large pdf document Elasticsearch	10	5865	July 5, 2017
ES + Attachment --> indexed documents incomplete Elasticsearch	11	637	July 6, 2017

FSCrawler Ingest pdf error Exceeds maximum allowed document size of 102400 bytes

Related topics