Ingest-attachment not parsing docx

ottadini · May 29, 2018, 12:03pm

I have ES 6.0.1 on my localhost, and the same on a AWS server. On my localhost I installed the ingest-attachment plugin, and tried to index some docx files. They are not parsed, and ES returns content-length of 0. On the server, using the same code, they are parsed as expected.

on localhost the document looks like:

{
  "_index": "testing",
  "_type": "documents",
  "_id": "45061422-cf3a-4d23-8b20-0ad27a272735",
  "_version": 1,
  "found": true,
  "_source": {
    "path": """D:\draft 1.docx""",
    "filename": "draft 1.docx",
    "attachment": {
      "content_type": "application/x-tika-ooxml",
      "content_length": 0
    }
  }
}

On the server it looks like:

{
  "_index": "testing",
  "_type": "documents",
  "_id": "afb12bcf-a8c8-43b4-838f-10336b28e91a",
  "_version": 1,
  "found": true,
  "_source": {
    "path": """D:\draft 1.docx""",
    "filename": "draft 1.docx",
    "attachment": {
      "date": "2018-05-09T05:51:00Z",
      "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "author": "Tom W",
      "language": "en",
      "title": "Letter",
      "content": """<snipped>""",
      "content_length": 28769
    }
  }
}

Why the difference in the content_type? What have I done wrong on my localhost?

dadoonet · May 29, 2018, 2:07pm

Interesting. What kind of servers do you have?
May be something related to a Locale setting?

ottadini · May 30, 2018, 5:05am

I've tried on yet another server, and no problem there either.

I think it must be an issue with apache tika. My laptop has jdk v10, while the servers have v8. If I parse the docx document directly in tika, it does work, but spits out some warnings and errors. Perhaps they are enough to ruin the party for ES. Here's an example of the output from tika:

on starting the jar:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.poi.openxml4j.util.ZipSecureFile$1 (file:/D:/opt/tika-1.17/tika-app-1.17.jar) to field java.io.FilterInputStream.in
WARNING: Please consider reporting this to the maintainers of org.apache.poi.openxml4j.util.ZipSecureFile$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

on parsing the docx:

X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.microsoft.ooxml.OOXMLParser
X-TIKA:EXCEPTION:embedded_stream_exception: java.lang.ClassCastException: org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream cannot be cast to java.base/java.util.zip.ZipFile$ZipFileInputStream

and

X-TIKA:EXCEPTION:embedded_stream_exception: java.lang.ClassCastException: org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream cannot be cast to java.base/java.util.zip.ZipFile$ZipFileInputStream

dadoonet · May 30, 2018, 5:16am

What happens if you are using jdk8 on your laptop?

ottadini · May 30, 2018, 6:08am

I don't have it installed. I'll try it soon and report back here.

ottadini · May 30, 2018, 6:47am

According to the docs, ElasticSearch honours the JAVA_HOME environment variable. I installed jdk8 (alongside an existing jdk10), then changed JAVA_HOME to point to the jdk8. I cannot get docx files to parse still.

I don't know what else to try, apart from uninstalling jdk10, which I don't want to do.

ottadini · May 30, 2018, 6:57am

Idiotic mistake - I forgot to restart the terminal. It had kept the old JAVA_HOME env variable.

Now it's happily indexing docx files with jdk8.

dadoonet · May 30, 2018, 10:24am

Great. Then I believe this is something to ask to the Tika team on https://issues.apache.org/jira/browse/TIKA

system · June 27, 2018, 10:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ingest plugin .docx issue Elasticsearch	8	1269	April 1, 2019
Ingest-attachment sometimes recognizes files as empty Elasticsearch	2	446	April 30, 2021
Ingest Attachment plugin not working with WPD files Elasticsearch	4	474	October 23, 2019
Troubles with different file types using ingest attachment processor plugin Elasticsearch	8	3206	February 23, 2017
Attachment Pipeline Support for Old MS Word and Excel Format Elasticsearch	4	598	December 28, 2021

Ingest-attachment not parsing docx

Related topics