Ingest-attachment not parsing docx


(ben) #1

I have ES 6.0.1 on my localhost, and the same on a AWS server. On my localhost I installed the ingest-attachment plugin, and tried to index some docx files. They are not parsed, and ES returns content-length of 0. On the server, using the same code, they are parsed as expected.

on localhost the document looks like:

{
  "_index": "testing",
  "_type": "documents",
  "_id": "45061422-cf3a-4d23-8b20-0ad27a272735",
  "_version": 1,
  "found": true,
  "_source": {
    "path": """D:\draft 1.docx""",
    "filename": "draft 1.docx",
    "attachment": {
      "content_type": "application/x-tika-ooxml",
      "content_length": 0
    }
  }
}

On the server it looks like:

{
  "_index": "testing",
  "_type": "documents",
  "_id": "afb12bcf-a8c8-43b4-838f-10336b28e91a",
  "_version": 1,
  "found": true,
  "_source": {
    "path": """D:\draft 1.docx""",
    "filename": "draft 1.docx",
    "attachment": {
      "date": "2018-05-09T05:51:00Z",
      "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "author": "Tom W",
      "language": "en",
      "title": "Letter",
      "content": """<snipped>""",
      "content_length": 28769
    }
  }
}

Why the difference in the content_type? What have I done wrong on my localhost?


(David Pilato) #2

Interesting. What kind of servers do you have?
May be something related to a Locale setting?


(ben) #3

I've tried on yet another server, and no problem there either.

I think it must be an issue with apache tika. My laptop has jdk v10, while the servers have v8. If I parse the docx document directly in tika, it does work, but spits out some warnings and errors. Perhaps they are enough to ruin the party for ES. Here's an example of the output from tika:

on starting the jar:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.poi.openxml4j.util.ZipSecureFile$1 (file:/D:/opt/tika-1.17/tika-app-1.17.jar) to field java.io.FilterInputStream.in
WARNING: Please consider reporting this to the maintainers of org.apache.poi.openxml4j.util.ZipSecureFile$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

on parsing the docx:

X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.microsoft.ooxml.OOXMLParser
X-TIKA:EXCEPTION:embedded_stream_exception: java.lang.ClassCastException: org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream cannot be cast to java.base/java.util.zip.ZipFile$ZipFileInputStream

and

X-TIKA:EXCEPTION:embedded_stream_exception: java.lang.ClassCastException: org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream cannot be cast to java.base/java.util.zip.ZipFile$ZipFileInputStream


(David Pilato) #4

What happens if you are using jdk8 on your laptop?


(ben) #5

I don't have it installed. I'll try it soon and report back here.


(ben) #6

According to the docs, ElasticSearch honours the JAVA_HOME environment variable. I installed jdk8 (alongside an existing jdk10), then changed JAVA_HOME to point to the jdk8. I cannot get docx files to parse still.

I don't know what else to try, apart from uninstalling jdk10, which I don't want to do.


(ben) #7

Idiotic mistake - I forgot to restart the terminal. It had kept the old JAVA_HOME env variable.

Now it's happily indexing docx files with jdk8.


(David Pilato) #8

Great. Then I believe this is something to ask to the Tika team on https://issues.apache.org/jira/browse/TIKA


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.