Our company recently had some connectivity issues with Elasticsearch. Suddenly, on some days of the week the cluster had RED status and in the logs the only information found was that the node had been disconnected.
On further investigation, we ended up discovering that one of our customers was trying to index an electronic docx file that contained a embedded PDF inside it, and when it try to make the request for the ingest-attachment plugin, Elastic service was killed by StackoverflowError. See the collected logs from the node:
[2020-10-25T01:09:42,214][INFO ][o.e.c.m.MetaDataCreateIndexService] [0f0qpyW] [2c9b808575556dcd0175556e03e30000] creating index, cause [api], templates [], shards [1]/[0], mappings [_doc]
[2020-10-25T01:09:42,422][INFO ][o.e.c.r.a.AllocationService] [0f0qpyW] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[2c9b808575556dcd0175556e03e30000][0]] ...]).
[2020-10-25T01:09:42,540][INFO ][o.e.c.m.MetaDataMappingService] [0f0qpyW] [2c9b808575556dcd0175556e03e30000/_z-DGoGXRyGOZkDNJp7pKw] update_mapping [_doc]
[2020-10-25T01:09:45,473][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [0f0qpyW] fatal error in thread [elasticsearch[0f0qpyW][write][T#2]], exiting
java.lang.StackOverflowError: null
at java.util.HashMap.hash(HashMap.java:339) ~[?:?]
at java.util.LinkedHashMap.get(LinkedHashMap.java:440) ~[?:?]
at org.apache.pdfbox.cos.COSDictionary.getDictionaryObject(COSDictionary.java:188) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:135) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:38) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:166) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:169) ~[?:?]
This embedded pdf is not even being opened by Adobe Acrobat Reader, accusing some error that it may be corrupted. But would it make sense for Elastic to die for that? Shouldn't he ignore cases like this so as not to compromise the entire cluster?
Elasticsearch version: 6.8.0 Host: AWS ElasticSearch Service
Cloud by elastic is one way to have access to all features, all managed by us. Think about what is there yet like Security, Monitoring, Reporting, SQL, Canvas, Maps UI, Alerting and built-in solutions named Observability, Security, Enterprise Search and what is coming next ...
I have tested most recently version (7.9.3), but same problem occurs.
Seeing the log with more attention, the problem occurs on PDPageTree wich comes from Apache PDFBox (wich I'm investigating now).
[2020-10-28T14:52:27,849][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [es-node] fatal error in thread [elasticsearch[es-node][write][T#7]], exiting
java.lang.StackOverflowError: null
at java.util.regex.Pattern$BmpCharPredicate.lambda$union$2(Pattern.java:5692) ~[?:?]
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:4019) ~[?:?]
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4855) ~[?:?]
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4763) ~[?:?]
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4886) ~[?:?]
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:4020) ~[?:?]
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4855) ~[?:?]
at java.util.regex.Pattern$Branch.match(Pattern.java:4800) ~[?:?]
at java.util.regex.Pattern$Branch.match(Pattern.java:4798) ~[?:?]
at java.util.regex.Pattern$Branch.match(Pattern.java:4798) ~[?:?]
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4763) ~[?:?]
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4886) ~[?:?]
at java.util.regex.Pattern$BmpCharPropertyGreedy.match(Pattern.java:4394) ~[?:?]
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4855) ~[?:?]
at java.util.regex.Pattern$Branch.match(Pattern.java:4800) ~[?:?]
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4763) ~[?:?]
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4886) ~[?:?]
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:4020) ~[?:?]
at java.util.regex.Pattern$BmpCharPropertyGreedy.match(Pattern.java:4394) ~[?:?]
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4855) ~[?:?]
at java.util.regex.Pattern$Branch.match(Pattern.java:4800) ~[?:?]
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:4020) ~[?:?]
at java.util.regex.Pattern$Start.match(Pattern.java:3673) ~[?:?]
at java.util.regex.Matcher.search(Matcher.java:1729) ~[?:?]
at java.util.regex.Matcher.find(Matcher.java:773) ~[?:?]
at java.util.Formatter.parse(Formatter.java:2702) ~[?:?]
at java.util.Formatter.format(Formatter.java:2655) ~[?:?]
at java.util.Formatter.format(Formatter.java:2609) ~[?:?]
at java.lang.String.format(String.java:3292) ~[?:?]
at java.util.logging.SimpleFormatter.format(SimpleFormatter.java:176) ~[?:?]
at java.util.logging.StreamHandler.publish(StreamHandler.java:199) ~[?:?]
at java.util.logging.ConsoleHandler.publish(ConsoleHandler.java:95) ~[?:?]
at java.util.logging.Logger.log(Logger.java:979) ~[?:?]
at java.util.logging.Logger.doLog(Logger.java:1006) ~[?:?]
at java.util.logging.Logger.logp(Logger.java:1172) ~[?:?]
at org.apache.commons.logging.impl.Jdk14Logger.log(Jdk14Logger.java:87) ~[?:?]
at org.apache.commons.logging.impl.Jdk14Logger.warn(Jdk14Logger.java:260) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree.getKids(PDPageTree.java:159) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree.access$200(PDPageTree.java:41) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:183) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
at org.apache.pdfbox.pdmodel.PDPageTree$PageIterator.enqueueKids(PDPageTree.java:186) ~[?:?]
But I think that problems that come from plugins used by ES should not affect the integrity of the node. Maybe it would be interesting to have some treat to just return an error in the request instead of killing the entire node. There is a flag that can be set in the pipelines (ignore_failure), but this did not solve this situation. Don't you think so?
We strongly encourage keeping Tika processing out of the same JVM/VM/M/rack/data center, as your indexer or even the ingest process.
This can be done with tika-batch, the ForkParser or tika-server. These three options remove the potential for catastrophic problems affecting the indexing process.
We do what we can when we find problems on Apache Tika, but we know and loudly proclaim that robust parsing of untrusted documents must be run in an isolated JVM.
We're happy to help you @dadoonet make FSCrawler and/or ingest-attachment more robust if you have an interest...
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.