I have a question about the fscrawler
Is it possible to crawl password protected Word documents? I came upon this issue on the Github repository, but It doesn't provide much insight outside of a single unit test. I don't seem to find any mention in the documentation either.
When trying to crawl a password protected Word document, I'm getting the following exception:
08:24:49,504 DEBUG [f.p.e.c.f.t.TikaDocParser] Failed to extract [100000] characters of text for [document-with-password.docx]
org.apache.tika.exception.EncryptedDocumentException: Unable to process: document is encrypted
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:274) ~[tika-parser-microsoft-module-2.9.1.jar:2.9.1]
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183) ~[tika-parser-microsoft-module-2.9.1.jar:2.9.1]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-core-2.9.1.jar:2.9.1]
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:203) ~[tika-core-2.9.1.jar:2.9.1]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaInstance.extractText(TikaInstance.java:197) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.TikaDocParser.generate(TikaDocParser.java:98) ~[fscrawler-tika-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.uploadToDocumentService(DocumentApi.java:205) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.rest.DocumentApi.addDocument(DocumentApi.java:94) ~[fscrawler-rest-2.10-SNAPSHOT.jar:?]
at jdk.internal.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) ~[?:?]
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) ~[jersey-server-3.1.5.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) [jersey-server-3.1.5.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) [jersey-server-3.1.5.jar:?]
at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) [jersey-server-3.1.5.jar:?]
at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) [jersey-server-3.1.5.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) [jersey-server-3.1.5.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) [jersey-server-3.1.5.jar:?]
at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) [jersey-server-3.1.5.jar:?]
at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261) [jersey-server-3.1.5.jar:?]
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) [jersey-common-3.1.5.jar:?]
at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) [jersey-common-3.1.5.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:292) [jersey-common-3.1.5.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:274) [jersey-common-3.1.5.jar:?]
at org.glassfish.jersey.internal.Errors.process(Errors.java:244) [jersey-common-3.1.5.jar:?]
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) [jersey-common-3.1.5.jar:?]
at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:240) [jersey-server-3.1.5.jar:?]
at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:697) [jersey-server-3.1.5.jar:?]
at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:367) [jersey-container-grizzly2-http-3.1.5.jar:?]
at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:190) [grizzly-http-server-4.0.1.jar:4.0.1]
at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:535) [grizzly-framework-4.0.1.jar:4.0.1]
at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:515) [grizzly-framework-4.0.1.jar:4.0.1]
at java.base/java.lang.Thread.run(Thread.java:834) [?:?]
08:24:49,505 TRACE [f.p.e.c.f.t.TikaDocParser] End document generation
I'm using version 2.10-20240325.073416-333 of the fscrawler.