Problem with .docx and the mapper-attachments plugin

Hi guys.

I'm currently evaluating elastic search for one of my little private software projects :slight_smile:

I'm trying to index attachments with the mapper-attachments plugin and the indexing of .txt and .pdf files works fine.

But I think there is an error in the indexing of .docx files. When searching for my test string, I get the PDF and .txt results, but not the .docx results. I already searched for this problem but I can't find a solution for this one.

I followed the instructions for index creation on https://www.elastic.co/guide/en/elasticsearch/plugins/2.2/mapper-attachments-highlighting.html and added the .docx file with the following command:

PUT /test/person/4?refresh=true
{
"file": {
"_content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"_content": "... MY BASE64 CONTENT ..."
}
}

Is this problem repeatable by anyone?

Thanks :slight_smile:

Yes. https://github.com/elastic/elasticsearch/pull/17059

1 Like

Thanks a lot for your hint.

Is there a way to update the plugin without the need of compiling it?

With the command "bin\plugin.bat install mapper-attachments" the docx files still won't work.

Thanks! :slight_smile:

I'm afraid you have to wait for elasticsearch 2.3.0 which fixes this.
I'm unsure if there is a plan to backport in 2.2.x.

@Clinton_Gormley1 WDYT?

Ok thanks, then I will wait till the release of 2.3.0

Is there an approximate release date of this version?

Thanks!

Hi @Markus_Ziegler

How is your project going?

itΒ΄s working now! I have used only the extension file when indexing, as below:

$filetype = 'docx'; // Or 'pdf' or 'txt' or 'doc'

  'body'  => [
       'file'    => [
             '_content_type'     => 'application/' . $filetype,
             '_name'             => $file, // full file pathname
             '_language'         => $lang, //language
             '_indexed_chars'    => -1,
             '_content'          => base64_encode(file_get_contents($file))
      ]
 ] 

Code for php library.

Cheers!

Hi @evert, what version are you running?

Thanks for all your assistance.

It's working now in the current version 2.3.3 and I'm able to index docx files as expected.

Loving this project :slight_smile:

Hi @Drammy I am using 2.3.3.

That's great, thanks.

Is there a list somewhere of the supported document types? (or is it just those supported in Tika?)

I think it's the ones supported in Tika. It uses Tika.

Actually we removed some dependencies and kept the most common ones for common file types like PDF, oOo, MSOffice, text...

Hi David,

Do you know if there's a list somewhere?

We need to index documents from a system that can store any doc type. If ES doesn't index a given document we'll need to handle that scenario by reading in the contents ourselves and indexing them separately...

That's what I'd do for all docs instead of sending binary content to Elasticsearch.
So I mean: don't use mapper attachments plugin if you can do text extraction in another way. You can use Tika by yourself actually.

I don't have such a list.

Thanks David, we'll look into this approach - makes sense to me.

Do you know where I can get such a list? Anyone I can ask?
I'd have thought there should be a list somewhere of what doc types the attachment plugin supports...

Well. We wrote this issue at some point: https://github.com/elastic/elasticsearch-mapper-attachments/issues/163

But if you use Tika on your side you are not limited by our implementation.

So the Tika list is https://tika.apache.org/1.13/formats.html