Hi David,
thx for your help, but it's still not working.
What I did:
The query
{
"query": {
"match": {
"_all": "test"
}
}
}
delivers all my indexed document (also the '.doc / *.docx files) and I can
see the base64 stuff in the file.file field.
So this looks good to me.
Then I went to ..\config\logging.yml and added under the "logger:" section
an entry for
1st attempt: "org.apache.plugin.mapper.attachments: TRACE"
2nd attempt: "org.apache.tika: TRACE"
After shutdown of ES, restart, deleting the existing index and reindexing
of my test documents there was no additional entry from the mapper plug or
tika in the log.
ES is logging fine...
logger:
log action execution errors for easier debugging
action: DEBUG
reduce the logging for aws, too much is logged under the default INFO
com.amazonaws: WARN
gateway
#gateway: DEBUG
#index.gateway: DEBUG
peer shard recovery
#indices.recovery: DEBUG
discovery
#discovery: TRACE
index.search.slowlog: TRACE, index_search_slow_log_file
index.indexing.slowlog: TRACE, index_indexing_slow_log_file
DBA: Enabled logger for plugin mapper.attachments
org.apache.plugin.mapper.attachments: TRACE
The next idea was that maybe the mapping plugin is missing some files for
parsing for Office documents?
In the plug-in folder I can see *.jar files for
rome-0.9.jar
tagsoup-1.2.1.jar
tika-core-1.5.jar
tika-parsers-1.5.jar
vorbis-java-core-0.1.jar
vorbis-java-core-0.1-tests.jar
vorbis-java-tika-0.1.jar
xercesImpl-2.8.1.jar
xml-apis-1.3.03.jar
xmpcore-5.1.2.jar
xz-1.2.jar
apache-mime4j-core-0.7.2.jar
apache-mime4j-dom-0.7.2.jar
asm-debug-all-4.1.jar
aspectjrt-1.6.11.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
boilerpipe-1.1.0.jar
commons-compress-1.5.jar
commons-logging-1.1.1.jar
elasticsearch-mapper-attachments-2.3.1.jar
fontbox-1.8.4.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
isoparser-1.0-RC-1.jar
jdom-1.0.jar
jempbox-1.8.4.jar
jhighlight-1.0.jar
juniversalchardet-1.0.3.jar
metadata-extractor-2.6.2.jar
netcdf-4.2-min.jar
pdfbox-1.8.4.jar
Not sure but here you will find additional files "poi*.jar" that should be
responsible to parse the office files:
http://mvnrepository.com/artifact/org.apache.tika/tika-parsers/1.5
The following files were downloaded to the plugin folder but the documents
are still not parsed...
poi-3.10-beta2.jar
poi-ooxml-3.10-beta2.jar
poi-scratchpad-3.10-beta2.jar
The last check was to make sure the word document are not corruped. A
colleage of mine has checked a test file with
java -jar tika-app-1.5.jar –g
and the output was fine for the document.
So, anyone some more ideas??
Thanks
Dirk
Am Montag, 25. August 2014 10:56:54 UTC+2 schrieb David Pilato:
From my experience, this should work. Indexing Word docs should work as
Tika support office docs.
Not sure what you are doing wrong. Try to send a match all query and ask
for field file.file.
Also, you could set mapper plugin to TRACE mode in logging.yml and see if
it tells something interesting.
HTH
--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
Le 25 août 2014 à 09:05, Dirk Bauer <dirk....@gmail.com <javascript:>> a
écrit :
Hi,
using elasticsearch-1.3.2 with
Plug-in
name: mapper-attachments
version: 2.3.1
description: Adds the attachment type allowing to parse difference
attachment formats
jvm: true
site: false
on Windows 8 for evaluation purpose.
JVM
version: 1.7.0_67
vm_name: Java HotSpot(TM) Client VM
vm_version: 24.65-b04
vm_vendor: Oracle Corporation
I have created the following mapping:
{
myIndex: {
mappings: {
dokument: {
properties: {
created: {
type: date
format: dateOptionalTime
}
description: {
type: string
}
file: {
type: attachment
path: full
fields: {
file: {
type: string
store: true
term_vector: with_positions_offsets
}
author: {
type: string
}
title: {
type: string
}
name: {
type: string
}
date: {
type: date
format: dateOptionalTime
}
keywords: {
type: string
}
content_type: {
type: string
}
content_length: {
type: integer
}
language: {
type: string
}
}
}
id: {
type: string
}
title: {
type: string
}
}
}
}
}
}
Because I like to use ES from C#/.NET I have created a little C# app that
reads a file as base64 encodes stream from hard drive and put the document
to the index of ES. I'm working with this POST request:
{
"id": "8dbf1d73-44d1-4e20-aa35-13b18ddf5057",
"title": "Test",
"description": "Test Description",
"created": "2014-01-20T19:04:20.1019885+01:00",
"file": {
"_content_type": "application/pdf",
"_name": "Test.pdf",
"content": "---my base64 stuff here---"
}
}
and send it as index command to ES like this:
myIndex/dokument/8dbf1d73-44d1-4e20-aa35-13b18ddf5057?refresh=true
After that I query ES with this request:
{
"fields": ,
"query": {
"match": {
"file": "test"
}
},
"highlight": {
"fields": {
"file": {}
}
}
}
If my input is a *.pdf or *.txt file everything works as expected. The
content of the document was recognized by the mapper-attachments plug-in
and the results with my string "test" that I'm looking for are highlighted.
I have searched for hours now to find a solution to do the same with
Microsoft Office documents but I'm not able to get it to work. ES does not
send any error message during adding the documents but I'm not able to find
the content of my office documents.
Can anyone please help me an give me an sample how to index a *.doc,
*.docx, *.xls, *.xlsx etc.?
I have tried to give ES a hint for the content-type / mime type based on
this link http://filext.com/faq/office_mime_types.php but this makes no
change.
Thanks in advance!
Dirk
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2a2d0406-4177-431f-ba33-8766a1ce4a07%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2a2d0406-4177-431f-ba33-8766a1ce4a07%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/be8e5d5d-8a7f-4788-b5db-a10ef61b8243%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.