Continuing the discussion from FSCrawler: Error while crawling - Invalid UTF-8 start byte 0xb5:
Hi,
I have a recent version of fscrawler-es7-2.7-SNAPSHOT installed. I have a folder containing a wide range of document types (JPG, XML, EML, PDF, DOCX, etc) which I want to index
When I turn on the xml_support: true
property FSCrawler stops working and logs the error:
FSCrawler: Error while crawling - Invalid UTF-8 start byte 0xb5
when it encounters a non XML file
I found the linked discussion describing the same problem / behaviour and in that discussion the #2 Post is marked as a solution. I fail to see the solution in that post. Setting xml_support: false
fixes the error but that means that I cannot use the Indexing XML Docs functionality described in indexing xml docs
Reading the documentation my expection is that setting xml_support: true
would only affect the processing of XML documents, not all formats.
Is there a bug related to xml_support: true
? Or should XML files be crawled by an independent FSCrawler from the other documents?
The used FSCrawler settings are:
name: "data"
fs:
url: "xxxxx"
update_rate: "15s"
excludes:
- "*/~*"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: true
index_folders: true
lang_detect: true
continue_on_error: true
ocr:
language: "eng+nld"
path: "xxxxx\\Tesseract-OCR"
data_path: "xxxxxxx\\Tesseract-OCR\\tessdata"
enabled: true
pdf_strategy: "auto"
follow_symlinks: false
elasticsearch:
nodes:
- url: "https://01.dev:9200"
- url: "https://02.dev:9200"
- url: "https://03.dev:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
username: "user"
password: "pwd"
Stacktrace:
11:06:20,618 DEBUG [f.p.e.c.f.FsParserAbstract] fetching content from [xxxxx.jpg]
11:06:20,639 TRACE [f.p.e.c.f.t.XmlDocParser] Converting XML document [{}]
11:06:20,639 TRACE [f.p.e.c.f.t.XmlDocParser] Converting XML document [{}]
11:06:20,656 WARN [f.p.e.c.f.FsParserAbstract] Error while crawling xxxxx: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xff (at char #1, byte #-1)
11:06:20,656 WARN [f.p.e.c.f.FsParserAbstract] Full stacktrace
java.lang.RuntimeException: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xff (at char #1, byte #-1)
at fr.pilato.elasticsearch.crawler.fs.tika.XmlDocParser.asMap(XmlDocParser.java:86) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.XmlDocParser.generateMap(XmlDocParser.java:76) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.tika.XmlDocParser.generate(XmlDocParser.java:60) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.indexFile(FsParserAbstract.java:498) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.addFilesRecursively(FsParserAbstract.java:267) ~[fscrawler-core-2.7-SNAPSHOT.jar:?]
at fr.pilato.elasticsearch.crawler.fs.FsParserAbstract.run(FsParserAbstract.java:149) [fscrawler-core-2.7-SNAPSHOT.jar:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0xff (at char #1, byte #-1)
at com.fasterxml.jackson.dataformat.xml.XmlFactory._initializeXmlReader(XmlFactory.java:690) ~[jackson-dataformat-xml-2.10.1.jar:2.10.1]
at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:568) ~[jackson-dataformat-xml-2.10.1.jar:2.10.1]
at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:29) ~[jackson-dataformat-xml-2.10.1.jar:2.10.1]
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:972) ~[jackson-core-2.10.1.jar:2.10.1]
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3242) ~[jackson-databind-2.10.1.jar:2.10.1]
at fr.pilato.elasticsearch.crawler.fs.tika.XmlDocParser.asMap(XmlDocParser.java:84) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
... 6 more
Caused by: com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 start byte 0xff (at char #1, byte #-1)
at com.ctc.wstx.sr.StreamScanner.constructFromIOE(StreamScanner.java:653) ~[woodstox-core-6.0.2.jar:6.0.2]
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1017) ~[woodstox-core-6.0.2.jar:6.0.2]
at com.ctc.wstx.sr.StreamScanner.getNext(StreamScanner.java:770) ~[woodstox-core-6.0.2.jar:6.0.2]
at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:2078) ~[woodstox-core-6.0.2.jar:6.0.2]
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1179) ~[woodstox-core-6.0.2.jar:6.0.2]
at com.fasterxml.jackson.dataformat.xml.XmlFactory._initializeXmlReader(XmlFactory.java:685) ~[jackson-dataformat-xml-2.10.1.jar:2.10.1]
at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:568) ~[jackson-dataformat-xml-2.10.1.jar:2.10.1]
at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:29) ~[jackson-dataformat-xml-2.10.1.jar:2.10.1]
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:972) ~[jackson-core-2.10.1.jar:2.10.1]
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3242) ~[jackson-databind-2.10.1.jar:2.10.1]
at fr.pilato.elasticsearch.crawler.fs.tika.XmlDocParser.asMap(XmlDocParser.java:84) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
... 6 more
Caused by: java.io.CharConversionException: Invalid UTF-8 start byte 0xff (at char #1, byte #-1)
at com.ctc.wstx.io.UTF8Reader.reportInvalidInitial(UTF8Reader.java:305) ~[woodstox-core-6.0.2.jar:6.0.2]
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:190) ~[woodstox-core-6.0.2.jar:6.0.2]
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:89) ~[woodstox-core-6.0.2.jar:6.0.2]
at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) ~[woodstox-core-6.0.2.jar:6.0.2]
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1011) ~[woodstox-core-6.0.2.jar:6.0.2]
at com.ctc.wstx.sr.StreamScanner.getNext(StreamScanner.java:770) ~[woodstox-core-6.0.2.jar:6.0.2]
at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:2078) ~[woodstox-core-6.0.2.jar:6.0.2]
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1179) ~[woodstox-core-6.0.2.jar:6.0.2]
at com.fasterxml.jackson.dataformat.xml.XmlFactory._initializeXmlReader(XmlFactory.java:685) ~[jackson-dataformat-xml-2.10.1.jar:2.10.1]
at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:568) ~[jackson-dataformat-xml-2.10.1.jar:2.10.1]
at com.fasterxml.jackson.dataformat.xml.XmlFactory._createParser(XmlFactory.java:29) ~[jackson-dataformat-xml-2.10.1.jar:2.10.1]
at com.fasterxml.jackson.core.JsonFactory.createParser(JsonFactory.java:972) ~[jackson-core-2.10.1.jar:2.10.1]
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3242) ~[jackson-databind-2.10.1.jar:2.10.1]
at fr.pilato.elasticsearch.crawler.fs.tika.XmlDocParser.asMap(XmlDocParser.java:84) ~[fscrawler-tika-2.7-SNAPSHOT.jar:?]
... 6 more