Hello,
I want to use FSCrawler to push my pdf books on Workplace Search.
I tried with different bulk_size and flush_interval but no way. I have the same maximum allowed document size error.
My traces logs are:
18:51:17,039 DEBUG [f.p.e.c.f.FsParserAbstract] [/Other/romeo_and_juliet.pdf] can be indexed: [true]
18:51:17,039 DEBUG [f.p.e.c.f.FsParserAbstract] - file: /Other/romeo_and_juliet.pdf
18:51:17,040 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [/srv/es/Other],[romeo_and_juliet.pdf]
18:51:17,040 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/srv/es, /srv/es/Other/romeo_and_juliet.pdf) = /Other/romeo_and_juliet.pdf
18:51:17,040 TRACE [f.p.e.c.f.t.TikaDocParser] Generating document [/srv/es/Other/romeo_and_juliet.pdf]
18:51:17,040 TRACE [f.p.e.c.f.t.TikaDocParser] Beginning Tika extraction
18:51:18,270 DEBUG [f.p.e.c.f.t.w.WPSearchBulkResponse] {"results":[{"id":"a4ba2a663e56d88185216a513553dc55","errors":[]},{"id":"f83f53e7451650d84d2a8e961ac4bf5c","errors":["Exceeds maximum allowed document size of 102400 bytes"]}]}
18:51:18,270 DEBUG [f.p.e.c.f.f.b.FsCrawlerSimpleBulkProcessorListener] Executed bulk composed of 2 actions
18:51:18,437 TRACE [f.p.e.c.f.t.TikaDocParser] End of Tika extraction
18:51:18,437 TRACE [f.p.e.c.f.t.TikaDocParser] Listing all available metadata:
18:51:18,437 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw.entrySet(), iterableWithSize(37));
18:51:18,437 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:unmappedUnicodeCharsPerPage", "0"));
18:51:18,437 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:PDFVersion", "1.4"));
18:51:18,438 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:docinfo:title", "Romeo and Juliet"));
18:51:18,438 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("xmp:CreatorTool", "calibre 2.53.0 [http://calibre-ebook.com]"));
18:51:18,439 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:hasXFA", "false"));
18:51:18,439 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("access_permission:modify_annotations", "true"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("access_permission:can_print_degraded", "true"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("dc:creator", "William Shakespeare"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("dcterms:created", "2017-02-25T04:34:55Z"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("dc:format", "application/pdf; version=1.4"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("title", "Romeo and Juliet"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:docinfo:creator_tool", "calibre 2.53.0 [http://calibre-ebook.com]"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("access_permission:fill_in_form", "true"));
18:51:18,440 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:encrypted", "false"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("dc:title", "Romeo and Juliet"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:hasMarkedContent", "false"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("Content-Type", "application/pdf"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:docinfo:creator", "William Shakespeare"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("X-Parsed-By", "org.apache.tika.parser.pdf.PDFParser"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("creator", "William Shakespeare"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("meta:author", "William Shakespeare"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("meta:creation-date", "2017-02-25T04:34:55Z"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("created", "2017-02-25T04:34:55Z"));
18:51:18,441 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("access_permission:extract_for_accessibility", "true"));
18:51:18,442 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("access_permission:assemble_document", "true"));
18:51:18,442 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("xmpTPg:NPages", "143"));
18:51:18,442 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("Creation-Date", "2017-02-25T04:34:55Z"));
18:51:18,442 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("resourceName", "romeo_and_juliet.pdf"));
18:51:18,442 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:hasXMP", "true"));
18:51:18,442 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:charsPerPage", "47"));
18:51:18,443 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("access_permission:extract_content", "true"));
18:51:18,443 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("access_permission:can_print", "true"));
18:51:18,443 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("Author", "William Shakespeare"));
18:51:18,443 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("producer", "calibre 2.53.0 [http://calibre-ebook.com]"));
18:51:18,443 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("access_permission:can_modify", "true"));
18:51:18,444 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:docinfo:producer", "calibre 2.53.0 [http://calibre-ebook.com]"));
18:51:18,444 TRACE [f.p.e.c.f.t.TikaDocParser] assertThat(raw, hasEntry("pdf:docinfo:created", "2017-02-25T04:34:55Z"));
18:51:18,445 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
18:51:18,445 TRACE [f.p.e.c.f.f.FsCrawlerUtil] No pattern always matches.
18:51:18,445 DEBUG [f.p.e.c.f.f.FsCrawlerUtil] computeVirtualPathName(/srv/es, /srv/es/Other/romeo_and_juliet.pdf) = /Other/romeo_and_juliet.pdf
18:51:18,445 DEBUG [f.p.e.c.f.s.FsCrawlerDocumentServiceWorkplaceSearchImpl] Indexing workplace/a3c830b8c88bda934b746d14be7167c?pipeline=null
18:51:18,446 DEBUG [f.p.e.c.f.t.w.WPSearchClient] Adding document {extension=pdf, comments=null, keywords=null, author=William Shakespeare, created_at=2021-05-14T18:51:00+02:00, language=null, body=
... here the 100 first pages ...
C, title=Romeo and Juliet, url=http://127.0.0.1/Other/romeo_and_juliet.pdf, path=/srv/es/Other/romeo_and_juliet.pdf, size=383706, text_size=null, mime_type=application/pdf, name=romeo_and_juliet.pdf, id=a3c830b8c88bda934b746d14be7167c, last_modified=2021-05-14T18:51:00+02:00}
C, title=Romeo and Juliet, url=http://127.0.0.1/Other/romeo_and_juliet.pdf, path=/srv/es/Other/romeo_and_juliet.pdf, size=383706, text_size=null, mime_type=application/pdf, name=romeo_and_juliet.pdf, id=a3c830b8c88bda934b746d14be7167c, last_modified=2021-05-14T18:51:00+02:00}
...
18:51:19,323 DEBUG [f.p.e.c.f.t.w.WPSearchBulkResponse] {"results":[{"id":"f728d2aec73a89f63bed4a66a51d6c37","errors":[]},{"id":"a3c830b8c88bda934b746d14be7167c","errors":["Exceeds maximum allowed document size of 102400 bytes"]}]}
I tried to remove in the _settings.yaml of my FSCrawler configuration the workplace_search and just keep the elasticsearch (my http.max_content_length is set to 400mb) and It work! I can see 100 pages of my book in the workplace index.
I don't now if it's possible to split the pdf to not grow any document out 102400 bytes or if it's possible to extend the max-content-lenght for my enterprise search. I read in another place it's not possible but in case I prefer asking.
Thank you very much and best regards