No hits when do a text search in an attachment for .docx file

MaryOConnor · February 15, 2016, 12:14pm

Hi,

I'm new to elasticsearch. I'm using version 2.2.0 of Elasticsearch.

I am able to index a .txt file and a .doc file and do a search and get a hit.

However, when I try to index a .docx file, I do not get any hits.

My index is client_index.

My type is documents.

This is my mapping json:

{
    "index":"client_index",
    "type":"documents",
    "body":{
        "documents":{
            "properties":{
                "file":{
                    "type":"attachment"
                }
            }
        }
    }
}

I base64 encode the contents of a .docx file and I add the document to the index with the following json:

{
        "index":"client_index",
        "type":"documents",
        "id":"1",
        "body":{
            "file": "' . $file_encoded . '"
        }
    }

This is the response that I get:

Array

(
    [_index] => client_index
    [_type] => documents
    [_id] => 1
    [_version] => 1
    [_shards] => Array
        (
            [total] => 2
            [successful] => 1
            [failed] => 0
        )

    [created] => 1
)

The content of the .docx file is simply: "This is a small test."

I do a search with the following json:

// Search for text in an attachment
$search_text = "test";
    
$json = '
{
    "index":"client_index",
    "type":"documents",
    "body":{
        "query":{
            "query_string":{
                "query": "' . $search_text . '"
            }
        }
    }
}

This is the response that I get:

(
    [_index] => client_index
    [_type] => documents
    [_id] => 1
    [_version] => 1
    [_shards] => Array
        (
            [total] => 2
            [successful] => 1
            [failed] => 0
        )

    [created] => 1
)

I'm searching for the string "test" but I get no hits.

Can someone help me with this please. Thanks.

Mary

dadoonet · February 15, 2016, 4:14pm

Can you check that:

plugin is loaded
mapping has been applied correctly

Then, can you try a match query on field file.content?

MaryOConnor · February 16, 2016, 10:21am

Hi David,

Thanks for your reply and for your help with this.

I managed yesterday to get the search working for all files types except .docx files, so that probably means that the plugin is loaded and that the mapping has been applied correctly but I must be missing something as it is supposed to work with .docx files also.

This is my mapping json:

{
"index":"client_index",
"type":"documents",
"body":{
"documents":{
"properties":{
"file":{
"type":"attachment",
"fields": {
"content": {
"type": "string",
"term_vector":"with_positions_offsets",
"store": true
}
}
}
}
}
}
}

and this is my index json:

{
"index":"client_index",
"type":"documents",
"id":"' . $index . '",
"body":{
"file":{
"_content":"' . $file_encoded . '",
"_name":"' . $file_to_index .'",
"_content_type":"' . $mime_type . '",
"_indexed_chars":"-1",
"_detect_language":"true"
}
}
}

and this is my search json:

{
"index":"client_index",
"type":"documents",
"body":{
"query":{
"match":{
"file.content": "' . $search_text . '"
}
},
"highlight": {
"fields": {
"file.content": {
}
}
}
}
}

Your help is much appreciated. Thanks. Mary

dadoonet · February 17, 2016, 4:33am

Can you share your file?

MaryOConnor · February 17, 2016, 11:38am

Hi David,

Here is a link to a gist:

gist.github.com

https://gist.github.com/silverarm/9657134efd9468832b7a

test-search.php

<?php

$main_index = "client_docs";

$main_type = "documents";

$hosts = [
    'http://localhost'
];

This file has been truncated. show original

Thanks for your help.

Mary

dadoonet · February 17, 2016, 5:55pm

I was just asking for the .docx file. Can you share it?

MaryOConnor · February 18, 2016, 10:14am

Hi David,

Here is a link to the Test.docx file:

http://silverarm.com/assets/Test.docx

Thanks.

Mary

dadoonet · February 18, 2016, 3:09pm

I tried to extract data from your doc and it works as expected.

## Extracted text
--------------------- BEGIN -----------------------
This is a small test.

---------------------- END ------------------------
## Metadata
- author: Mary O'Connor
- content_length: 11375
- content_type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
- date: 1455550860000
- keywords: null
- language: null
- name: null
- title: null

What is the output when you run your query?

thespatt · February 22, 2016, 7:40pm

Hello, I'm having the same issue. Was using the old plugin in ES 1.7 and it worked fine, but using ES 2.2 with the new plugin i cannot search on the docx content. If I re-save the docx as Word 97-2003 doc it works perfect. Tried with several different documents.

dvaneynde · February 29, 2016, 4:17pm

Same issue here, works with Text of PDF, but not with the Office formats. I tried with a small Word document, and this is what I see in the logs:

[2016-02-29 16:43:39,341][DEBUG][mapper.attachment ] Failed to extract [100000] characters of text for [null]: [Unexpected RuntimeExceptio
n from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51667d8a]
...
Caused by: java.lang.IllegalStateException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at org.apache.xmlbeans.XmlBeans.getContextTypeLoader(XmlBeans.java:336)
...
Caused by: java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)

dadoonet · February 29, 2016, 4:38pm

Thanks @dvaneynde and others. Very helpful.

I opened https://github.com/elastic/elasticsearch/issues/16864 to track this.

Topic		Replies	Views
Problem with .docx and the mapper-attachments plugin Elasticsearch	16	2073	July 5, 2017
Attachment(PDF/DOC) Indexing and Searching on ElasticSearch in PHP Elasticsearch	10	6348	February 19, 2017
Attachments questions Elasticsearch	2	252	July 6, 2017
ElasticSearch Search Query and Highlight on the Attachment not working as expected Elasticsearch	1	809	July 5, 2017
Ingest plugin .docx issue Elasticsearch	8	1268	April 1, 2019

No hits when do a text search in an attachment for .docx file

Related topics