Not able to fulltext index Microsoft Office documents - PDF works fine


(Dirk) #1

Hi,

using elasticsearch-1.3.2 with

Plug-in

name: mapper-attachments
version: 2.3.1
description: Adds the attachment type allowing to parse difference attachment formats
jvm: true
site: false

on Windows 8 for evaluation purpose.

JVM

version: 1.7.0_67
vm_name: Java HotSpot(TM) Client VM
vm_version: 24.65-b04
vm_vendor: Oracle Corporation

I have created the following mapping:

{
myIndex: {
mappings: {
dokument: {
properties: {
created: {
type: date
format: dateOptionalTime
}
description: {
type: string
}
file: {
type: attachment
path: full
fields: {
file: {
type: string
store: true
term_vector: with_positions_offsets
}
author: {
type: string
}
title: {
type: string
}
name: {
type: string
}
date: {
type: date
format: dateOptionalTime
}
keywords: {
type: string
}
content_type: {
type: string
}
content_length: {
type: integer
}
language: {
type: string
}
}
}
id: {
type: string
}
title: {
type: string
}
}
}
}
}
}

Because I like to use ES from C#/.NET I have created a little C# app that reads a file as base64 encodes stream from hard drive and put the document to the index of ES. I'm working with this POST request:

{
"id": "8dbf1d73-44d1-4e20-aa35-13b18ddf5057",
"title": "Test",
"description": "Test Description",
"created": "2014-01-20T19:04:20.1019885+01:00",
"file": {
"_content_type": "application/pdf",
"_name": "Test.pdf",
"content": "---my base64 stuff here---"
}
}

and send it as index command to ES like this:

myIndex/dokument/8dbf1d73-44d1-4e20-aa35-13b18ddf5057?refresh=true

After that I query ES with this request:

{
"fields": [],
"query": {
"match": {
"file": "test"
}
},
"highlight": {
"fields": {
"file": {}
}
}
}

If my input is a *.pdf or *.txt file everything works as expected. The content of the document was recognized by the mapper-attachments plug-in and the results with my string "test" that I'm looking for are highlighted.

I have searched for hours now to find a solution to do the same with Microsoft Office documents but I'm not able to get it to work. ES does not send any error message during adding the documents but I'm not able to find the content of my office documents.
Can anyone please help me an give me an sample how to index a *.doc, *.docx, *.xls, *.xlsx etc.?

I have tried to give ES a hint for the content-type / mime type based on this link http://filext.com/faq/office_mime_types.php but this makes no change.

Thanks in advance!
Dirk


(David Pilato) #2

This is indeed an issue in mapper attachments plugin 2.3.1.

Will be fixed early next week with 2.3.2.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 23 août 2014 à 18:28, Dirk dirk.bauer@gmail.com a écrit :

Hi,

using elasticsearch-1.3.2 with

Plug-in

name: mapper-attachments
version: 2.3.1
description: Adds the attachment type allowing to parse difference
attachment formats
jvm: true
site: false

on Windows 8 for evaluation purpose.

JVM

version: 1.7.0_67
vm_name: Java HotSpot(TM) Client VM
vm_version: 24.65-b04
vm_vendor: Oracle Corporation

I have created the following mapping:

{
myIndex: {
mappings: {
dokument: {
properties: {
created: {
type: date
format: dateOptionalTime
}
description: {
type: string
}
file: {
type: attachment
path: full
fields: {
file: {
type: string
store: true
term_vector: with_positions_offsets
}
author: {
type: string
}
title: {
type: string
}
name: {
type: string
}
date: {
type: date
format: dateOptionalTime
}
keywords: {
type: string
}
content_type: {
type: string
}
content_length: {
type: integer
}
language: {
type: string
}
}
}
id: {
type: string
}
title: {
type: string
}
}
}
}
}
}

Because I like to use ES from C#/.NET I have created a little C# app that
reads a file as base64 encodes stream from hard drive and put the document
to the index of ES. I'm working with this POST request:

{
"id": "8dbf1d73-44d1-4e20-aa35-13b18ddf5057",
"title": "Test",
"description": "Test Description",
"created": "2014-01-20T19:04:20.1019885+01:00",
"file": {
"_content_type": "application/pdf",
"_name": "Test.pdf",
"content": "---my base64 stuff here---"
}
}

and send it as index command to ES like this:

myIndex/dokument/8dbf1d73-44d1-4e20-aa35-13b18ddf5057?refresh=true

After that I query ES with this request:

{
"fields": [],
"query": {
"match": {
"file": "test"
}
},
"highlight": {
"fields": {
"file": {}
}
}
}

If my input is a *.pdf or *.txt file everything works as expected. The
content of the document was recognized by the mapper-attachments plug-in and
the results with my string "test" that I'm looking for are highlighted.

I have searched for hours now to find a solution to do the same with
Microsoft Office documents but I'm not able to get it to work. ES does not
send any error message during adding the documents but I'm not able to find
the content of my office documents.
Can anyone please help me an give me an sample how to index a *.doc, *.docx,
*.xls, *.xlsx etc.?

I have tried to give ES a hint for the content-type / mime type based on
this link http://filext.com/faq/office_mime_types.php but this makes no
change.

Thanks in advance!
Dirk

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Not-able-to-fulltext-index-Microsoft-Office-documents-PDF-works-fine-tp4062325.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1408811281465-4062325.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/C412E552-9A24-479A-AAE2-D8D38022A2B2%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(feenz) #3

Hi David,
I am currently using elasticsearch-1.3.1. Will the mapper-attchements-2.3.2 be compatible with my version of ES or will have have to update?

Thanks,

  • Kyle

(David Pilato) #4

It will work with 1.3.1.
You should update to 1.3.2 though because we fixed some issues in this version.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 29 août 2014 à 16:07, feenz kfeeney5506@gmail.com a écrit :

Hi David,
I am currently using elasticsearch-1.3.1. Will the mapper-attchements-2.3.2
be compatible with my version of ES or will have have to update?

Thanks,

  • Kyle

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Not-able-to-fulltext-index-Microsoft-Office-documents-PDF-works-fine-tp4062325p4062665.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1409321264321-4062665.post%40n3.nabble.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/FBD8CEB7-43FE-46B5-928D-9E325B6B684F%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(robnikkel) #5

Hi, I can confirm that none of my MS Office docs are being indexed with ES 1.3.2 and mapper attachments 2.3.1. Any estimate on a fix for this?


(robnikkel) #6

Wow, you're fast, I just saw you put up version 2.3.2. I downloaded it, restarted ES, and reindexed and all is good now with my MS Office format docs. Thanks!


(system) #7