No hits when do a text search in an attachment for .docx file

Hi,

I'm new to elasticsearch. I'm using version 2.2.0 of Elasticsearch.

I am able to index a .txt file and a .doc file and do a search and get a hit.

However, when I try to index a .docx file, I do not get any hits.

My index is client_index.

My type is documents.

This is my mapping json:

{
    "index":"client_index",
    "type":"documents",
    "body":{
        "documents":{
            "properties":{
                "file":{
                    "type":"attachment"
                }
            }
        }
    }
}

I base64 encode the contents of a .docx file and I add the document to the index with the following json:

{
        "index":"client_index",
        "type":"documents",
        "id":"1",
        "body":{
            "file": "' . $file_encoded . '"
        }
    }

This is the response that I get:

Array

(
    [_index] => client_index
    [_type] => documents
    [_id] => 1
    [_version] => 1
    [_shards] => Array
        (
            [total] => 2
            [successful] => 1
            [failed] => 0
        )

    [created] => 1
)

The content of the .docx file is simply: "This is a small test."

I do a search with the following json:

// Search for text in an attachment
$search_text = "test";
    
$json = '
{
    "index":"client_index",
    "type":"documents",
    "body":{
        "query":{
            "query_string":{
                "query": "' . $search_text . '"
            }
        }
    }
}

This is the response that I get:

(
    [_index] => client_index
    [_type] => documents
    [_id] => 1
    [_version] => 1
    [_shards] => Array
        (
            [total] => 2
            [successful] => 1
            [failed] => 0
        )

    [created] => 1
)

I'm searching for the string "test" but I get no hits.

Can someone help me with this please. Thanks.

Mary

Can you check that:

  • plugin is loaded
  • mapping has been applied correctly

Then, can you try a match query on field file.content?

Hi David,

Thanks for your reply and for your help with this.

I managed yesterday to get the search working for all files types except .docx files, so that probably means that the plugin is loaded and that the mapping has been applied correctly but I must be missing something as it is supposed to work with .docx files also.

This is my mapping json:

{
"index":"client_index",
"type":"documents",
"body":{
"documents":{
"properties":{
"file":{
"type":"attachment",
"fields": {
"content": {
"type": "string",
"term_vector":"with_positions_offsets",
"store": true
}
}
}
}
}
}
}

and this is my index json:

{
"index":"client_index",
"type":"documents",
"id":"' . $index . '",
"body":{
"file":{
"_content":"' . $file_encoded . '",
"_name":"' . $file_to_index .'",
"_content_type":"' . $mime_type . '",
"_indexed_chars":"-1",
"_detect_language":"true"
}
}
}

and this is my search json:

{
"index":"client_index",
"type":"documents",
"body":{
"query":{
"match":{
"file.content": "' . $search_text . '"
}
},
"highlight": {
"fields": {
"file.content": {
}
}
}
}
}

Your help is much appreciated. Thanks. Mary

Can you share your file?

Hi David,

Here is a link to a gist:

Thanks for your help.

Mary

I was just asking for the .docx file. Can you share it?

Hi David,

Here is a link to the Test.docx file:

http://silverarm.com/assets/Test.docx

Thanks.

Mary

I tried to extract data from your doc and it works as expected.

## Extracted text
--------------------- BEGIN -----------------------
This is a small test.

---------------------- END ------------------------
## Metadata
- author: Mary O'Connor
- content_length: 11375
- content_type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
- date: 1455550860000
- keywords: null
- language: null
- name: null
- title: null

What is the output when you run your query?

Hello, I'm having the same issue. Was using the old plugin in ES 1.7 and it worked fine, but using ES 2.2 with the new plugin i cannot search on the docx content. If I re-save the docx as Word 97-2003 doc it works perfect. Tried with several different documents.

Same issue here, works with Text of PDF, but not with the Office formats. I tried with a small Word document, and this is what I see in the logs:

[2016-02-29 16:43:39,341][DEBUG][mapper.attachment ] Failed to extract [100000] characters of text for [null]: [Unexpected RuntimeExceptio
n from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@51667d8a]
...
Caused by: java.lang.IllegalStateException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at org.apache.xmlbeans.XmlBeans.getContextTypeLoader(XmlBeans.java:336)
...
Caused by: java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)

Thanks @dvaneynde and others. Very helpful.

I opened https://github.com/elastic/elasticsearch/issues/16864 to track this.