Ingest plugin .docx issue

Ananth_Rao · March 4, 2019, 11:29am

Hi ,

i'm using ES 6.5.4 and ingest-plugin . It works fine with pdf , txt or any other file types, but breaks for .docx and .doc files.
The index gets created , but unable to parse and search .
Below is the output .
"attachment":{"content_type":"application/x-tika-ooxml","content_length":0}}

can you please help me resolving this issue ?

dadoonet · March 4, 2019, 11:40am

Could you share your .docx document? I'd like to try it.

Ananth_Rao · March 4, 2019, 12:07pm

Thanks for the reply . My docx file doesn't contain any image or special representations . its a simple , plain text file in docx . I would like to add one more piece of info . I have 2 different versions of ES in different servers . There are no issues with docx in ES 6.5.0 , but i'm facing issue with ES 6.5.4 .

dadoonet · March 4, 2019, 12:19pm

But could you share it then?

Ananth_Rao · March 4, 2019, 1:55pm

hi , can you tell me how can to share the file ?

dadoonet · March 4, 2019, 2:10pm

May be there? https://filebin.ca/

Ananth_Rao · March 4, 2019, 2:19pm

https://filebin.ca/4Z2J1tObkmTt/test123.docx

please use the above the file

dadoonet · March 4, 2019, 2:47pm

I tried your document on a 6.6.1 version with:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "test",
    "processors": [
      {
        "attachment": {
          "field": "data"
        }
      },
      {
        "remove": {
          "field": "data"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "data": "***BASE 64 CONTENT***"
      }
    }
  ]
}

This gave:

{
  "docs" : [
    {
      "doc" : {
        "_index" : "index",
        "_type" : "_doc",
        "_id" : "id",
        "_source" : {
          "attachment" : {
            "date" : "2019-03-04T12:03:00Z",
            "language" : "et",
            "content_type" : "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
            "author" : "Ananthmurthy Rao",
            "content" : """
1. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio. 

Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis, vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus sit amet mauris tempus fringilla.
""",
            "content_length" : 816
          }
        },
        "_ingest" : {
          "timestamp" : "2019-03-04T14:46:00.089544Z"
        }
      }
    }
  ]
}

system · April 1, 2019, 2:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ingest-attachment not parsing docx Elasticsearch	8	1225	June 27, 2018
Troubles with different file types using ingest attachment processor plugin Elasticsearch	8	3206	February 23, 2017
Ingest Attachment plugin not working with WPD files Elasticsearch	4	474	October 23, 2019
Attachment Pipeline Support for Old MS Word and Excel Format Elasticsearch	4	598	December 28, 2021
Best way to use Ingest Attachment Plugin Elasticsearch ingest-pipeline	4	534	December 31, 2021

Ingest plugin .docx issue

Related topics