How do I map an array of attachments


(Michiel) #1

I have items, which I want to index, which have multiple attachments.
On beforehand I don't know how much, so I want to make an array of
attachments.

Basic checks:

  • Using Elasticsearch 0.11.0
  • Installed the attachments mapper (on running it says loaded [mapper-
    attachments, analysis-icu])
  • Using the REST API

So i have the index myitems, and the type docitem.

For indexing I use:

URL: http://myserver:9200/myitems/docitem/1
Content:
{
"id" : 1,
"name": "test",
"description": "test",
"files" : [
{
"_content_type" : "application/msword",
"_name" : "resource/name/of/my.doc",
"content" : "... base64 encoded attachment ..."
},
{
"_content_type" : "application/pdf",
"_name" : "resource/name/of/my.pdf",
"content" : "... base64 encoded attachment ..."
},
{
"_content_type" : "application/msexcel",
"_name" : "resource/name/of/my.xls",
"content" : "... base64 encoded attachment ..."
},
]
}

So I suppose the put_mapping whould be:

URL: http://myserver:9200/myitems/docitem/_mapping
Content:
{
"docitem": {
"properties" : {
"files" : { "type" : "attachment" }
}
}
}

But when I search the contents of a word document as attachment, I
don't get any result, but searching for any other field does.

What am I doing wrong?


(Lukáš Vlček) #2

Hi,

do you think you could share your query? How it looks like?

Regards,
Lukas

On Thu, Oct 7, 2010 at 4:04 PM, Michiel michieleghuizen@gmail.com wrote:

I have items, which I want to index, which have multiple attachments.
On beforehand I don't know how much, so I want to make an array of
attachments.

Basic checks:

  • Using Elasticsearch 0.11.0
  • Installed the attachments mapper (on running it says loaded [mapper-
    attachments, analysis-icu])
  • Using the REST API

So i have the index myitems, and the type docitem.

For indexing I use:

URL: http://myserver:9200/myitems/docitem/1
Content:
{
"id" : 1,
"name": "test",
"description": "test",
"files" : [
{
"_content_type" : "application/msword",
"_name" : "resource/name/of/my.doc",
"content" : "... base64 encoded attachment
..."
},
{
"_content_type" : "application/pdf",
"_name" : "resource/name/of/my.pdf",
"content" : "... base64 encoded attachment
..."
},
{
"_content_type" : "application/msexcel",
"_name" : "resource/name/of/my.xls",
"content" : "... base64 encoded attachment
..."
},
]
}

So I suppose the put_mapping whould be:

URL: http://myserver:9200/myitems/docitem/_mapping
Content:
{
"docitem": {
"properties" : {
"files" : { "type" : "attachment" }
}
}
}

But when I search the contents of a word document as attachment, I
don't get any result, but searching for any other field does.

What am I doing wrong?


(Michiel) #3

Actally plain and simple, because I read default_filed is by default
_all.

URL: http://myserver:9200/myitems/docitem/_search?q=Sync4J

And Sync4J is a word in the content of the MS Word document.

On Oct 7, 4:14 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

do you think you could share your query? How it looks like?

Regards,
Lukas

On Thu, Oct 7, 2010 at 4:04 PM, Michiel michieleghui...@gmail.com wrote:

I have items, which I want to index, which have multiple attachments.
On beforehand I don't know how much, so I want to make an array of
attachments.

Basic checks:

  • Using Elasticsearch 0.11.0
  • Installed the attachments mapper (on running it says loaded [mapper-
    attachments, analysis-icu])
  • Using the REST API

So i have the index myitems, and the type docitem.

For indexing I use:

URL:http://myserver:9200/myitems/docitem/1
Content:
{
"id" : 1,
"name": "test",
"description": "test",
"files" : [
{
"_content_type" : "application/msword",
"_name" : "resource/name/of/my.doc",
"content" : "... base64 encoded attachment
..."
},
{
"_content_type" : "application/pdf",
"_name" : "resource/name/of/my.pdf",
"content" : "... base64 encoded attachment
..."
},
{
"_content_type" : "application/msexcel",
"_name" : "resource/name/of/my.xls",
"content" : "... base64 encoded attachment
..."
},
]
}

So I suppose the put_mapping whould be:

URL:http://myserver:9200/myitems/docitem/_mapping
Content:
{
"docitem": {
"properties" : {
"files" : { "type" : "attachment" }
}
}
}

But when I search the contents of a word document as attachment, I
don't get any result, but searching for any other field does.

What am I doing wrong?


(Lukáš Vlček) #4

I did not try to index attachments in array myself but single attachment per
document works fine or me. And your mappings look good to me. If I were you
I would try to check if other words from that document work for search or
not. Then I would try to check what content the Tika was able to get out of
this document (there is Swing console for Tika that can be used for quick
tests, see:
http://tika.apache.org/0.7/gettingstarted.html#Using_Tika_as_a_command_line_utility
,
just run it and drag and drop that document to it and see the content Tika
is able extract). If the word is there then I would try check elasticsearch
index with Luke.

HTH,
Lukas

On Thu, Oct 7, 2010 at 5:25 PM, Michiel michieleghuizen@gmail.com wrote:

Actally plain and simple, because I read default_filed is by default
_all.

URL: http://myserver:9200/myitems/docitem/_search?q=Sync4J

And Sync4J is a word in the content of the MS Word document.

On Oct 7, 4:14 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

do you think you could share your query? How it looks like?

Regards,
Lukas

On Thu, Oct 7, 2010 at 4:04 PM, Michiel michieleghui...@gmail.com
wrote:

I have items, which I want to index, which have multiple attachments.
On beforehand I don't know how much, so I want to make an array of
attachments.

Basic checks:

  • Using Elasticsearch 0.11.0
  • Installed the attachments mapper (on running it says loaded [mapper-
    attachments, analysis-icu])
  • Using the REST API

So i have the index myitems, and the type docitem.

For indexing I use:

URL:http://myserver:9200/myitems/docitem/1
Content:
{
"id" : 1,
"name": "test",
"description": "test",
"files" : [
{
"_content_type" : "application/msword",
"_name" : "resource/name/of/my.doc",
"content" : "... base64 encoded
attachment

..."
},
{
"_content_type" : "application/pdf",
"_name" : "resource/name/of/my.pdf",
"content" : "... base64 encoded
attachment

..."
},
{
"_content_type" : "application/msexcel",
"_name" : "resource/name/of/my.xls",
"content" : "... base64 encoded
attachment

..."
},
]
}

So I suppose the put_mapping whould be:

URL:http://myserver:9200/myitems/docitem/_mapping
Content:
{
"docitem": {
"properties" : {
"files" : { "type" : "attachment" }
}
}
}

But when I search the contents of a word document as attachment, I
don't get any result, but searching for any other field does.

What am I doing wrong?


(Michiel) #5

I found and solved "temporary" the problem.

I did try to check if tika could make something from the Word
document, and it did. It could read the document successfully.

I got the error message (as a part of an MapperParsingException and a
backtrace):
Caused by: org.elasticsearch.common.jackson.JsonParseException:
Failed to decode VALUE_STRING as base64 (MIME-NO-LINEFEEDS): Illegal
character '' (code 0x5c) in base64 content

Now I use PHP, and php with json_encode does escape / characters. So
in the whole document there were a lot of / Now as a matter of test
I did replace all / occurences back to / and then it did index. So
there seems to be a bug in, or PHP or in Elasticsearch JSON parser
Jackson. I'm going to make a proof of concept, so I post this as bug
report.

Thanks for the help.

Greetings,

Michiel

On Oct 7, 11:19 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

I did not try to index attachments in array myself but single attachment per
document works fine or me. And your mappings look good to me. If I were you
I would try to check if other words from that document work for search or
not. Then I would try to check what content the Tika was able to get out of
this document (there is Swing console for Tika that can be used for quick
tests, see:http://tika.apache.org/0.7/gettingstarted.html#Using_Tika_as_a_comman...
,
just run it and drag and drop that document to it and see the content Tika
is able extract). If the word is there then I would try check elasticsearch
index with Luke.

HTH,
Lukas

On Thu, Oct 7, 2010 at 5:25 PM, Michiel michieleghui...@gmail.com wrote:

Actally plain and simple, because I read default_filed is by default
_all.

URL:http://myserver:9200/myitems/docitem/_search?q=Sync4J

And Sync4J is a word in the content of the MS Word document.

On Oct 7, 4:14 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

do you think you could share your query? How it looks like?

Regards,
Lukas

On Thu, Oct 7, 2010 at 4:04 PM, Michiel michieleghui...@gmail.com
wrote:

I have items, which I want to index, which have multiple attachments.
On beforehand I don't know how much, so I want to make an array of
attachments.

Basic checks:

  • Using Elasticsearch 0.11.0
  • Installed the attachments mapper (on running it says loaded [mapper-
    attachments, analysis-icu])
  • Using the REST API

So i have the index myitems, and the type docitem.

For indexing I use:

URL:http://myserver:9200/myitems/docitem/1
Content:
{
"id" : 1,
"name": "test",
"description": "test",
"files" : [
{
"_content_type" : "application/msword",
"_name" : "resource/name/of/my.doc",
"content" : "... base64 encoded
attachment

..."
},
{
"_content_type" : "application/pdf",
"_name" : "resource/name/of/my.pdf",
"content" : "... base64 encoded
attachment

..."
},
{
"_content_type" : "application/msexcel",
"_name" : "resource/name/of/my.xls",
"content" : "... base64 encoded
attachment

..."
},
]
}

So I suppose the put_mapping whould be:

URL:http://myserver:9200/myitems/docitem/_mapping
Content:
{
"docitem": {
"properties" : {
"files" : { "type" : "attachment" }
}
}
}

But when I search the contents of a word document as attachment, I
don't get any result, but searching for any other field does.

What am I doing wrong?


(Michiel) #6

Seems a bug in PHP (see: http://bugs.php.net/bug.php?id=49366 )
however they don't seem to think it's a bug, but jackson won't handle
it as escaped character though.

On 8 okt, 09:11, Michiel michieleghui...@gmail.com wrote:

I found and solved "temporary" the problem.

I did try to check if tika could make something from the Word
document, and it did. It could read the document successfully.

I got the error message (as a part of an MapperParsingException and a
backtrace):
Caused by: org.elasticsearch.common.jackson.JsonParseException:
Failed to decode VALUE_STRING as base64 (MIME-NO-LINEFEEDS): Illegal
character '' (code 0x5c) in base64 content

Now I use PHP, and php with json_encode does escape / characters. So
in the whole document there were a lot of / Now as a matter of test
I did replace all / occurences back to / and then it did index. So
there seems to be a bug in, or PHP or in Elasticsearch JSON parser
Jackson. I'm going to make a proof of concept, so I post this as bug
report.

Thanks for the help.

Greetings,

Michiel

On Oct 7, 11:19 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

I did not try to index attachments in array myself but single attachment per
document works fine or me. And your mappings look good to me. If I were you
I would try to check if other words from that document work for search or
not. Then I would try to check what content the Tika was able to get out of
this document (there is Swing console for Tika that can be used for quick
tests, see:http://tika.apache.org/0.7/gettingstarted.html#Using_Tika_as_a_comman...
,
just run it and drag and drop that document to it and see the content Tika
is able extract). If the word is there then I would try check elasticsearch
index with Luke.

HTH,
Lukas

On Thu, Oct 7, 2010 at 5:25 PM, Michiel michieleghui...@gmail.com wrote:

Actally plain and simple, because I read default_filed is by default
_all.

URL:http://myserver:9200/myitems/docitem/_search?q=Sync4J

And Sync4J is a word in the content of the MS Word document.

On Oct 7, 4:14 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

do you think you could share your query? How it looks like?

Regards,
Lukas

On Thu, Oct 7, 2010 at 4:04 PM, Michiel michieleghui...@gmail.com
wrote:

I have items, which I want to index, which have multiple attachments.
On beforehand I don't know how much, so I want to make an array of
attachments.

Basic checks:

  • Using Elasticsearch 0.11.0
  • Installed the attachments mapper (on running it says loaded [mapper-
    attachments, analysis-icu])
  • Using the REST API

So i have the index myitems, and the type docitem.

For indexing I use:

URL:http://myserver:9200/myitems/docitem/1
Content:
{
"id" : 1,
"name": "test",
"description": "test",
"files" : [
{
"_content_type" : "application/msword",
"_name" : "resource/name/of/my.doc",
"content" : "... base64 encoded
attachment

..."
},
{
"_content_type" : "application/pdf",
"_name" : "resource/name/of/my.pdf",
"content" : "... base64 encoded
attachment

..."
},
{
"_content_type" : "application/msexcel",
"_name" : "resource/name/of/my.xls",
"content" : "... base64 encoded
attachment

..."
},
]
}

So I suppose the put_mapping whould be:

URL:http://myserver:9200/myitems/docitem/_mapping
Content:
{
"docitem": {
"properties" : {
"files" : { "type" : "attachment" }
}
}
}

But when I search the contents of a word document as attachment, I
don't get any result, but searching for any other field does.

What am I doing wrong?


(Michiel) #7

However seeing this chart on JSON.org http://www.json.org/string.gif
It does say that it's standard. But everything besides this chart says
not anything about escaping the / char.

On 9 okt, 10:44, Michiel michieleghui...@gmail.com wrote:

Seems a bug in PHP (see:http://bugs.php.net/bug.php?id=49366)
however they don't seem to think it's a bug, but jackson won't handle
it as escaped character though.

On 8 okt, 09:11, Michiel michieleghui...@gmail.com wrote:

I found and solved "temporary" the problem.

I did try to check if tika could make something from the Word
document, and it did. It could read the document successfully.

I got the error message (as a part of an MapperParsingException and a
backtrace):
Caused by: org.elasticsearch.common.jackson.JsonParseException:
Failed to decode VALUE_STRING as base64 (MIME-NO-LINEFEEDS): Illegal
character '' (code 0x5c) in base64 content

Now I use PHP, and php with json_encode does escape / characters. So
in the whole document there were a lot of / Now as a matter of test
I did replace all / occurences back to / and then it did index. So
there seems to be a bug in, or PHP or in Elasticsearch JSON parser
Jackson. I'm going to make a proof of concept, so I post this as bug
report.

Thanks for the help.

Greetings,

Michiel

On Oct 7, 11:19 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

I did not try to index attachments in array myself but single attachment per
document works fine or me. And your mappings look good to me. If I were you
I would try to check if other words from that document work for search or
not. Then I would try to check what content the Tika was able to get out of
this document (there is Swing console for Tika that can be used for quick
tests, see:http://tika.apache.org/0.7/gettingstarted.html#Using_Tika_as_a_comman...
,
just run it and drag and drop that document to it and see the content Tika
is able extract). If the word is there then I would try check elasticsearch
index with Luke.

HTH,
Lukas

On Thu, Oct 7, 2010 at 5:25 PM, Michiel michieleghui...@gmail.com wrote:

Actally plain and simple, because I read default_filed is by default
_all.

URL:http://myserver:9200/myitems/docitem/_search?q=Sync4J

And Sync4J is a word in the content of the MS Word document.

On Oct 7, 4:14 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

do you think you could share your query? How it looks like?

Regards,
Lukas

On Thu, Oct 7, 2010 at 4:04 PM, Michiel michieleghui...@gmail.com
wrote:

I have items, which I want to index, which have multiple attachments.
On beforehand I don't know how much, so I want to make an array of
attachments.

Basic checks:

  • Using Elasticsearch 0.11.0
  • Installed the attachments mapper (on running it says loaded [mapper-
    attachments, analysis-icu])
  • Using the REST API

So i have the index myitems, and the type docitem.

For indexing I use:

URL:http://myserver:9200/myitems/docitem/1
Content:
{
"id" : 1,
"name": "test",
"description": "test",
"files" : [
{
"_content_type" : "application/msword",
"_name" : "resource/name/of/my.doc",
"content" : "... base64 encoded
attachment

..."
},
{
"_content_type" : "application/pdf",
"_name" : "resource/name/of/my.pdf",
"content" : "... base64 encoded
attachment

..."
},
{
"_content_type" : "application/msexcel",
"_name" : "resource/name/of/my.xls",
"content" : "... base64 encoded
attachment

..."
},
]
}

So I suppose the put_mapping whould be:

URL:http://myserver:9200/myitems/docitem/_mapping
Content:
{
"docitem": {
"properties" : {
"files" : { "type" : "attachment" }
}
}
}

But when I search the contents of a word document as attachment, I
don't get any result, but searching for any other field does.

What am I doing wrong?


(Michiel) #8

Never mind, seems that / is a escapeable character, but some people
think it doesn't have to be escaped. That it doesn't have to be
escaped I didn't find yet, so I guess it's not true.

json.org as well as jackson source code (
http://svn.jackson.codehaus.org/browse/jackson/trunk/src/java/org/codehaus/jackson/impl/ReaderBasedParser.java?r=HEAD#l881
) says / is an escapeable char.

On 9 okt, 10:44, Michiel michieleghui...@gmail.com wrote:

Seems a bug in PHP (see:http://bugs.php.net/bug.php?id=49366)
however they don't seem to think it's a bug, but jackson won't handle
it as escaped character though.

On 8 okt, 09:11, Michiel michieleghui...@gmail.com wrote:

I found and solved "temporary" the problem.

I did try to check if tika could make something from the Word
document, and it did. It could read the document successfully.

I got the error message (as a part of an MapperParsingException and a
backtrace):
Caused by: org.elasticsearch.common.jackson.JsonParseException:
Failed to decode VALUE_STRING as base64 (MIME-NO-LINEFEEDS): Illegal
character '' (code 0x5c) in base64 content

Now I use PHP, and php with json_encode does escape / characters. So
in the whole document there were a lot of / Now as a matter of test
I did replace all / occurences back to / and then it did index. So
there seems to be a bug in, or PHP or in Elasticsearch JSON parser
Jackson. I'm going to make a proof of concept, so I post this as bug
report.

Thanks for the help.

Greetings,

Michiel

On Oct 7, 11:19 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

I did not try to index attachments in array myself but single attachment per
document works fine or me. And your mappings look good to me. If I were you
I would try to check if other words from that document work for search or
not. Then I would try to check what content the Tika was able to get out of
this document (there is Swing console for Tika that can be used for quick
tests, see:http://tika.apache.org/0.7/gettingstarted.html#Using_Tika_as_a_comman...
,
just run it and drag and drop that document to it and see the content Tika
is able extract). If the word is there then I would try check elasticsearch
index with Luke.

HTH,
Lukas

On Thu, Oct 7, 2010 at 5:25 PM, Michiel michieleghui...@gmail.com wrote:

Actally plain and simple, because I read default_filed is by default
_all.

URL:http://myserver:9200/myitems/docitem/_search?q=Sync4J

And Sync4J is a word in the content of the MS Word document.

On Oct 7, 4:14 pm, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

do you think you could share your query? How it looks like?

Regards,
Lukas

On Thu, Oct 7, 2010 at 4:04 PM, Michiel michieleghui...@gmail.com
wrote:

I have items, which I want to index, which have multiple attachments.
On beforehand I don't know how much, so I want to make an array of
attachments.

Basic checks:

  • Using Elasticsearch 0.11.0
  • Installed the attachments mapper (on running it says loaded [mapper-
    attachments, analysis-icu])
  • Using the REST API

So i have the index myitems, and the type docitem.

For indexing I use:

URL:http://myserver:9200/myitems/docitem/1
Content:
{
"id" : 1,
"name": "test",
"description": "test",
"files" : [
{
"_content_type" : "application/msword",
"_name" : "resource/name/of/my.doc",
"content" : "... base64 encoded
attachment

..."
},
{
"_content_type" : "application/pdf",
"_name" : "resource/name/of/my.pdf",
"content" : "... base64 encoded
attachment

..."
},
{
"_content_type" : "application/msexcel",
"_name" : "resource/name/of/my.xls",
"content" : "... base64 encoded
attachment

..."
},
]
}

So I suppose the put_mapping whould be:

URL:http://myserver:9200/myitems/docitem/_mapping
Content:
{
"docitem": {
"properties" : {
"files" : { "type" : "attachment" }
}
}
}

But when I search the contents of a word document as attachment, I
don't get any result, but searching for any other field does.

What am I doing wrong?


(system) #9