Hi,
using elasticsearch-1.3.2 with
Plug-in
name: mapper-attachments
version: 2.3.1
description: Adds the attachment type allowing to parse difference attachment formats
jvm: true
site: false
on Windows 8 for evaluation purpose.
JVM
version: 1.7.0_67
vm_name: Java HotSpot(TM) Client VM
vm_version: 24.65-b04
vm_vendor: Oracle Corporation
I have created the following mapping:
{
myIndex: {
mappings: {
dokument: {
properties: {
created: {
type: date
format: dateOptionalTime
}
description: {
type: string
}
file: {
type: attachment
path: full
fields: {
file: {
type: string
store: true
term_vector: with_positions_offsets
}
author: {
type: string
}
title: {
type: string
}
name: {
type: string
}
date: {
type: date
format: dateOptionalTime
}
keywords: {
type: string
}
content_type: {
type: string
}
content_length: {
type: integer
}
language: {
type: string
}
}
}
id: {
type: string
}
title: {
type: string
}
}
}
}
}
}
Because I like to use ES from C#/.NET I have created a little C# app that reads a file as base64 encodes stream from hard drive and put the document to the index of ES. I'm working with this POST request:
{
"id": "8dbf1d73-44d1-4e20-aa35-13b18ddf5057",
"title": "Test",
"description": "Test Description",
"created": "2014-01-20T19:04:20.1019885+01:00",
"file": {
"_content_type": "application/pdf",
"_name": "Test.pdf",
"content": "---my base64 stuff here---"
}
}
and send it as index command to ES like this:
myIndex/dokument/8dbf1d73-44d1-4e20-aa35-13b18ddf5057?refresh=true
After that I query ES with this request:
{
"fields": [],
"query": {
"match": {
"file": "test"
}
},
"highlight": {
"fields": {
"file": {}
}
}
}
If my input is a *.pdf or *.txt file everything works as expected. The content of the document was recognized by the mapper-attachments plug-in and the results with my string "test" that I'm looking for are highlighted.
I have searched for hours now to find a solution to do the same with Microsoft Office documents but I'm not able to get it to work. ES does not send any error message during adding the documents but I'm not able to find the content of my office documents.
Can anyone please help me an give me an sample how to index a *.doc, *.docx, *.xls, *.xlsx etc.?
I have tried to give ES a hint for the content-type / mime type based on this link http://filext.com/faq/office_mime_types.php but this makes no change.
Thanks in advance!
Dirk