Index binary files (PDF, ...)


(Frifri) #1

Hey everybody,

at the moment I'm testing ES for the use as an indexing and storaging
solution. I use the following general settings with analysis-icu and
mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/' -d '{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

    {
            "testmap": {
                    "properties" : {
                            "files" : { "type" : "attachment" }
                    }

      }
    }

'

Afterwards, i curl a merged XML-File to ES, in which the content is
created by "base64 SOURCE.pdf > TARGET"

{

"files" : [
{
"_content_type" : "application/pdf",
"_name" : " LeParfait_disclaimer_fr.pdf",
"content" : " ...base64-encoded content.."
}
]
}

Unfortunately, the content field does not contain any keyword from the
pdf (just the long string) and every search returns nothing. I think,
I misunderstood something - I'm quite new to JSON/Java and similar
technique.

Thanks in advance

Regards,
Frifri


(Lukáš Vlček) #2

Hi,

can you double check that the attachment plugin is installed correctly?

  1. make sure both tika and mapper attachment jar files are in <ES_HOME>/lib
    folder

For example if you user 0.17.0-SNAPSHOT you should see the following two
files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs and check ES
startup log records sequence. You should see something like

[2011-06-22 11:44:38,456][INFO ][node ] [Amber Hunt]
{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: initializing ...
[2011-06-22 11:44:38,481][INFO ][plugins ] [Amber Hunt]
loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node ] [Amber Hunt]
{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: initialized
[2011-06-22 11:44:41,322][INFO ][node ] [Amber Hunt]
{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: starting ...
...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri henning.frieder@googlemail.comwrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing and storaging
solution. I use the following general settings with analysis-icu and
mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/' -d '{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" : "attachment" }
                   }

         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the content is
created by "base64 SOURCE.pdf > TARGET"

{

"files" : [
{
"_content_type" : "application/pdf",
"_name" : " LeParfait_disclaimer_fr.pdf",
"content" : " ...base64-encoded content.."
}
]
}

Unfortunately, the content field does not contain any keyword from the
pdf (just the long string) and every search returns nothing. I think,
I misunderstood something - I'm quite new to JSON/Java and similar
technique.

Thanks in advance

Regards,
Frifri


(Lukáš Vlček) #3

Once we make sure that these two things are ok then we can look at mappings
and searching in detail :slight_smile:

On Thu, Jul 7, 2011 at 2:54 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi,

can you double check that the attachment plugin is installed correctly?

  1. make sure both tika and mapper attachment jar files are in <ES_HOME>/lib
    folder

For example if you user 0.17.0-SNAPSHOT you should see the following two
files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs and check ES
startup log records sequence. You should see something like

[2011-06-22 11:44:38,456][INFO ][node ] [Amber Hunt]
{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: initializing ...
[2011-06-22 11:44:38,481][INFO ][plugins ] [Amber Hunt]
loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node ] [Amber Hunt]
{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: initialized
[2011-06-22 11:44:41,322][INFO ][node ] [Amber Hunt]
{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: starting ...
...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri henning.frieder@googlemail.comwrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing and storaging
solution. I use the following general settings with analysis-icu and
mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/' -d '{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" : "attachment" }
                   }

         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the content is
created by "base64 SOURCE.pdf > TARGET"

{

"files" : [
{
"_content_type" : "application/pdf",
"_name" : " LeParfait_disclaimer_fr.pdf",
"content" : " ...base64-encoded content.."
}
]
}

Unfortunately, the content field does not contain any keyword from the
pdf (just the long string) and every search returns nothing. I think,
I misunderstood something - I'm quite new to JSON/Java and similar
technique.

Thanks in advance

Regards,
Frifri


(Frifri) #4

Thanks you very much for your support.

I use the precompiled ES 0.16.2 and the two files weren't there, but I
found them to be in ESHOME/plugins/mapper-attachments/ and just copied
them to lib/.
Perhaps that helps somehow?!

And both plugins seem to be properly installed:

[2011-07-07 06:06:09,682][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initializing ...
[2011-07-07 06:06:09,711][INFO ][plugins ] [test]
loaded [mapper-attachments, analysis-icu]
[2011-07-07 06:06:17,724][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initialized

On 7 Jul., 14:57, Lukáš Vlček lukas.vl...@gmail.com wrote:

Once we make sure that these two things are ok then we can look at mappings
and searching in detail :slight_smile:

On Thu, Jul 7, 2011 at 2:54 PM, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

can you double check that the attachment plugin is installed correctly?

  1. make sure both tika and mapper attachment jar files are in <ES_HOME>/lib
    folder

For example if you user 0.17.0-SNAPSHOT you should see the following two
files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs and check ES
startup log records sequence. You should see something like

[2011-06-22 11:44:38,456][INFO ][node ] [Amber Hunt]
{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: initializing ...
[2011-06-22 11:44:38,481][INFO ][plugins ] [Amber Hunt]
loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node ] [Amber Hunt]
{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: initialized
[2011-06-22 11:44:41,322][INFO ][node ] [Amber Hunt]
{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: starting ...
...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri henning.frie...@googlemail.comwrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing and storaging
solution. I use the following general settings with analysis-icu and
mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/'-d '{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" : "attachment" }
                   }
         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the content is
created by "base64 SOURCE.pdf > TARGET"

{

"files" : [
{
"_content_type" : "application/pdf",
"_name" : " LeParfait_disclaimer_fr.pdf",
"content" : " ...base64-encoded content.."
}
]
}

Unfortunately, the content field does not contain any keyword from the
pdf (just the long string) and every search returns nothing. I think,
I misunderstood something - I'm quite new to JSON/Java and similar
technique.

Thanks in advance

Regards,
Frifri


(Lukáš Vlček) #5

Hi,

You can either install plugins manually by copying these files to the lib
folder or you can use bin/plugin install command as explained here:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/icu-plugin.html

Once you have both plugins correctly installed and ES is loading them during
startup then you should be ready to use them. Did you try it again? What was
the result now? Any exceptions in log files when indexing attachments or
when searching them?

Regards,
Lukas

On Thu, Jul 7, 2011 at 3:09 PM, Frifri henning.frieder@googlemail.comwrote:

Thanks you very much for your support.

I use the precompiled ES 0.16.2 and the two files weren't there, but I
found them to be in ESHOME/plugins/mapper-attachments/ and just copied
them to lib/.
Perhaps that helps somehow?!

And both plugins seem to be properly installed:

[2011-07-07 06:06:09,682][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initializing ...
[2011-07-07 06:06:09,711][INFO ][plugins ] [test]
loaded [mapper-attachments, analysis-icu]
[2011-07-07 06:06:17,724][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initialized

On 7 Jul., 14:57, Lukáš Vlček lukas.vl...@gmail.com wrote:

Once we make sure that these two things are ok then we can look at
mappings
and searching in detail :slight_smile:

On Thu, Jul 7, 2011 at 2:54 PM, Lukáš Vlček lukas.vl...@gmail.com
wrote:

Hi,

can you double check that the attachment plugin is installed correctly?

  1. make sure both tika and mapper attachment jar files are in
    <ES_HOME>/lib

folder

For example if you user 0.17.0-SNAPSHOT you should see the following
two

files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs and check ES
startup log records sequence. You should see something like

[2011-06-22 11:44:38,456][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: initializing
...

[2011-06-22 11:44:38,481][INFO ][plugins ] [Amber
Hunt]

loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: initialized
[2011-06-22 11:44:41,322][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: starting ...
...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri <henning.frie...@googlemail.com
wrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing and storaging
solution. I use the following general settings with analysis-icu and
mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/'-d '{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" : "attachment" }
                   }
         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the content is
created by "base64 SOURCE.pdf > TARGET"

{

"files" : [
{
"_content_type" : "application/pdf",
"_name" : " LeParfait_disclaimer_fr.pdf",
"content" : " ...base64-encoded content.."
}
]
}

Unfortunately, the content field does not contain any keyword from the
pdf (just the long string) and every search returns nothing. I think,
I misunderstood something - I'm quite new to JSON/Java and similar
technique.

Thanks in advance

Regards,
Frifri


(Frifri) #6

Hey,

1.) what I meant to say was that I had both plugins already installed
with the command you cited (bin/plugin install ....) and they seem to
be loaded as I've shown above. Nevertheless, the associated jar-files
are not in the lib/-directory, only in plugins/mapper-attachments.
Do you think I have to manually move or copy them?

2.) And you are right, in the logfiles I get an error if I'm trying to
index a pdf-file:

[2011-07-07 15:53:30,296][DEBUG][action.index ] [test]
[testindex][0], node[YEPolMwkRhe_eJdNAyuz4Q], [P], s[STARTED]: Failed
to execute [index {[testindex][testmap][1], source[]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive
xcontent from []
at
org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
181)
at
org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
172)
at
org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:
401)
at
org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:
380)
at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:
278)
[......]

So this looks as if the plugin for treating the content isn't working.

Thanks again!
Frieder

On 7 Jul., 15:36, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

You can either install plugins manually by copying these files to the lib
folder or you can use bin/plugin install command as explained here:http://www.elasticsearch.org/guide/reference/index-modules/analysis/i...

Once you have both plugins correctly installed and ES is loading them during
startup then you should be ready to use them. Did you try it again? What was
the result now? Any exceptions in log files when indexing attachments or
when searching them?

Regards,
Lukas

On Thu, Jul 7, 2011 at 3:09 PM, Frifri henning.frie...@googlemail.comwrote:

Thanks you very much for your support.

I use the precompiled ES 0.16.2 and the two files weren't there, but I
found them to be in ESHOME/plugins/mapper-attachments/ and just copied
them to lib/.
Perhaps that helps somehow?!

And both plugins seem to be properly installed:

[2011-07-07 06:06:09,682][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initializing ...
[2011-07-07 06:06:09,711][INFO ][plugins ] [test]
loaded [mapper-attachments, analysis-icu]
[2011-07-07 06:06:17,724][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initialized

On 7 Jul., 14:57, Lukáš Vlček lukas.vl...@gmail.com wrote:

Once we make sure that these two things are ok then we can look at
mappings
and searching in detail :slight_smile:

On Thu, Jul 7, 2011 at 2:54 PM, Lukáš Vlček lukas.vl...@gmail.com
wrote:

Hi,

can you double check that the attachment plugin is installed correctly?

  1. make sure both tika and mapper attachment jar files are in
    <ES_HOME>/lib

folder

For example if you user 0.17.0-SNAPSHOT you should see the following
two

files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs and check ES
startup log records sequence. You should see something like

[2011-06-22 11:44:38,456][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: initializing
...

[2011-06-22 11:44:38,481][INFO ][plugins ] [Amber
Hunt]

loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: initialized
[2011-06-22 11:44:41,322][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: starting ...
...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri <henning.frie...@googlemail.com
wrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing and storaging
solution. I use the following general settings with analysis-icu and
mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/'-d'{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" : "attachment" }
                   }
         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the content is
created by "base64 SOURCE.pdf > TARGET"

{

"files" : [
{
"_content_type" : "application/pdf",
"_name" : " LeParfait_disclaimer_fr.pdf",
"content" : " ...base64-encoded content.."
}
]
}

Unfortunately, the content field does not contain any keyword from the
pdf (just the long string) and every search returns nothing. I think,
I misunderstood something - I'm quite new to JSON/Java and similar
technique.

Thanks in advance

Regards,
Frifri


(Lukáš Vlček) #7

Hi,

On Thu, Jul 7, 2011 at 4:05 PM, Frifri henning.frieder@googlemail.comwrote:

Hey,

1.) what I meant to say was that I had both plugins already installed
with the command you cited (bin/plugin install ....) and they seem to
be loaded as I've shown above. Nevertheless, the associated jar-files
are not in the lib/-directory, only in plugins/mapper-attachments.
Do you think I have to manually move or copy them?

I normally do not use plugin script for installation because I directly copy
related jar files into lib folder. However, using plugin script is probably
better way and as long as you can see those plugins are loaded during node
startup then they you did good job (assuming you made sure you installed
plugins on ALL nodes in the cluster).

2.) And you are right, in the logfiles I get an error if I'm trying to
index a pdf-file:

So probably we should try to find a simple working example because from the
quick glance at your mapping file it looks fine to me. Again I assume that
you index your json document into /testindex/testmap/

[2011-07-07 15:53:30,296][DEBUG][action.index ] [test]
[testindex][0], node[YEPolMwkRhe_eJdNAyuz4Q], [P], s[STARTED]: Failed
to execute [index {[testindex][testmap][1], source[]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive
xcontent from []
at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
181)
at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
172)
at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:
401)
at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:
380)
at

org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:
278)
[......]

So this looks as if the plugin for treating the content isn't working.

Thanks again!
Frieder

On 7 Jul., 15:36, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

You can either install plugins manually by copying these files to the lib
folder or you can use bin/plugin install command as explained here:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/i...

Once you have both plugins correctly installed and ES is loading them
during
startup then you should be ready to use them. Did you try it again? What
was
the result now? Any exceptions in log files when indexing attachments or
when searching them?

Regards,
Lukas

On Thu, Jul 7, 2011 at 3:09 PM, Frifri <henning.frie...@googlemail.com
wrote:

Thanks you very much for your support.

I use the precompiled ES 0.16.2 and the two files weren't there, but I
found them to be in ESHOME/plugins/mapper-attachments/ and just copied
them to lib/.
Perhaps that helps somehow?!

And both plugins seem to be properly installed:

[2011-07-07 06:06:09,682][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initializing ...
[2011-07-07 06:06:09,711][INFO ][plugins ] [test]
loaded [mapper-attachments, analysis-icu]
[2011-07-07 06:06:17,724][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initialized

On 7 Jul., 14:57, Lukáš Vlček lukas.vl...@gmail.com wrote:

Once we make sure that these two things are ok then we can look at
mappings
and searching in detail :slight_smile:

On Thu, Jul 7, 2011 at 2:54 PM, Lukáš Vlček lukas.vl...@gmail.com
wrote:

Hi,

can you double check that the attachment plugin is installed
correctly?

  1. make sure both tika and mapper attachment jar files are in
    <ES_HOME>/lib

folder

For example if you user 0.17.0-SNAPSHOT you should see the
following

two

files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs and check
ES

startup log records sequence. You should see something like

[2011-06-22 11:44:38,456][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initializing

...

[2011-06-22 11:44:38,481][INFO ][plugins ] [Amber
Hunt]

loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initialized

[2011-06-22 11:44:41,322][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: starting
...

...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing and
storaging

solution. I use the following general settings with analysis-icu
and

mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/'-d'{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" : "attachment" }
                   }
         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the content
is

created by "base64 SOURCE.pdf > TARGET"

{

"files" : [
{
"_content_type" : "application/pdf",
"_name" : " LeParfait_disclaimer_fr.pdf",
"content" : " ...base64-encoded content.."
}
]
}

Unfortunately, the content field does not contain any keyword from
the

pdf (just the long string) and every search returns nothing. I
think,

I misunderstood something - I'm quite new to JSON/Java and similar
technique.

Thanks in advance

Regards,
Frifri


(Frifri) #8

Yeah, but at the moment I've really no clue. Isn't there any
possibility to verify whether the plugin is working or not? Or do you
have another idea what to try else?

On 7 Jul., 16:43, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

On Thu, Jul 7, 2011 at 4:05 PM, Frifri henning.frie...@googlemail.comwrote:

Hey,

1.) what I meant to say was that I had both plugins already installed
with the command you cited (bin/plugin install ....) and they seem to
be loaded as I've shown above. Nevertheless, the associated jar-files
are not in the lib/-directory, only in plugins/mapper-attachments.
Do you think I have to manually move or copy them?

I normally do not use plugin script for installation because I directly copy
related jar files into lib folder. However, using plugin script is probably
better way and as long as you can see those plugins are loaded during node
startup then they you did good job (assuming you made sure you installed
plugins on ALL nodes in the cluster).

2.) And you are right, in the logfiles I get an error if I'm trying to
index a pdf-file:

So probably we should try to find a simple working example because from the
quick glance at your mapping file it looks fine to me. Again I assume that
you index your json document into /testindex/testmap/

[2011-07-07 15:53:30,296][DEBUG][action.index ] [test]
[testindex][0], node[YEPolMwkRhe_eJdNAyuz4Q], [P], s[STARTED]: Failed
to execute [index {[testindex][testmap][1], source[]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive
xcontent from []
at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
181)
at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
172)
at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:
401)
at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:
380)
at

org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:
278)
[......]

So this looks as if the plugin for treating the content isn't working.

Thanks again!
Frieder

On 7 Jul., 15:36, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

You can either install plugins manually by copying these files to the lib
folder or you can use bin/plugin install command as explained here:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/i...

Once you have both plugins correctly installed and ES is loading them
during
startup then you should be ready to use them. Did you try it again? What
was
the result now? Any exceptions in log files when indexing attachments or
when searching them?

Regards,
Lukas

On Thu, Jul 7, 2011 at 3:09 PM, Frifri <henning.frie...@googlemail.com
wrote:

Thanks you very much for your support.

I use the precompiled ES 0.16.2 and the two files weren't there, but I
found them to be in ESHOME/plugins/mapper-attachments/ and just copied
them to lib/.
Perhaps that helps somehow?!

And both plugins seem to be properly installed:

[2011-07-07 06:06:09,682][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initializing ...
[2011-07-07 06:06:09,711][INFO ][plugins ] [test]
loaded [mapper-attachments, analysis-icu]
[2011-07-07 06:06:17,724][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initialized

On 7 Jul., 14:57, Lukáš Vlček lukas.vl...@gmail.com wrote:

Once we make sure that these two things are ok then we can look at
mappings
and searching in detail :slight_smile:

On Thu, Jul 7, 2011 at 2:54 PM, Lukáš Vlček lukas.vl...@gmail.com
wrote:

Hi,

can you double check that the attachment plugin is installed
correctly?

  1. make sure both tika and mapper attachment jar files are in
    <ES_HOME>/lib

folder

For example if you user 0.17.0-SNAPSHOT you should see the
following

two

files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs and check
ES

startup log records sequence. You should see something like

[2011-06-22 11:44:38,456][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initializing

...

[2011-06-22 11:44:38,481][INFO ][plugins ] [Amber
Hunt]

loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initialized

[2011-06-22 11:44:41,322][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: starting
...

...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing and
storaging

solution. I use the following general settings with analysis-icu
and

mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/'-d'{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" : "attachment" }
                   }
         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the content
is

created by "base64 SOURCE.pdf > TARGET"

{

"files" : [
{
"_content_type" : "application/pdf",
"_name" : " LeParfait_disclaimer_fr.pdf",
"content" : " ...base64-encoded content.."
}
]
}

Unfortunately, the content field does not contain any keyword from
the

pdf (just the long string) and every search returns nothing. I
think,

I misunderstood something - I'm quite new to JSON/Java and similar
technique.

Thanks in advance

Regards,
Frifri


(Alberto Paro-2) #9

Pay attention to the correct base64 encoding. I had some problems to find the correct base64 encoding in implementing attachment support in pyes

Sent from my iPhone

On 07/lug/2011, at 22:04, Frifri henning.frieder@googlemail.com wrote:

Yeah, but at the moment I've really no clue. Isn't there any
possibility to verify whether the plugin is working or not? Or do you
have another idea what to try else?

On 7 Jul., 16:43, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

On Thu, Jul 7, 2011 at 4:05 PM, Frifri henning.frie...@googlemail.comwrote:

Hey,

1.) what I meant to say was that I had both plugins already installed
with the command you cited (bin/plugin install ....) and they seem to
be loaded as I've shown above. Nevertheless, the associated jar-files
are not in the lib/-directory, only in plugins/mapper-attachments.
Do you think I have to manually move or copy them?

I normally do not use plugin script for installation because I directly copy
related jar files into lib folder. However, using plugin script is probably
better way and as long as you can see those plugins are loaded during node
startup then they you did good job (assuming you made sure you installed
plugins on ALL nodes in the cluster).

2.) And you are right, in the logfiles I get an error if I'm trying to
index a pdf-file:

So probably we should try to find a simple working example because from the
quick glance at your mapping file it looks fine to me. Again I assume that
you index your json document into /testindex/testmap/

[2011-07-07 15:53:30,296][DEBUG][action.index ] [test]
[testindex][0], node[YEPolMwkRhe_eJdNAyuz4Q], [P], s[STARTED]: Failed
to execute [index {[testindex][testmap][1], source[]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive
xcontent from []
at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
181)
at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
172)
at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:
401)
at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:
380)
at

org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:
278)
[......]

So this looks as if the plugin for treating the content isn't working.

Thanks again!
Frieder

On 7 Jul., 15:36, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

You can either install plugins manually by copying these files to the lib
folder or you can use bin/plugin install command as explained here:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/i...

Once you have both plugins correctly installed and ES is loading them
during
startup then you should be ready to use them. Did you try it again? What
was
the result now? Any exceptions in log files when indexing attachments or
when searching them?

Regards,
Lukas

On Thu, Jul 7, 2011 at 3:09 PM, Frifri <henning.frie...@googlemail.com
wrote:

Thanks you very much for your support.

I use the precompiled ES 0.16.2 and the two files weren't there, but I
found them to be in ESHOME/plugins/mapper-attachments/ and just copied
them to lib/.
Perhaps that helps somehow?!

And both plugins seem to be properly installed:

[2011-07-07 06:06:09,682][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initializing ...
[2011-07-07 06:06:09,711][INFO ][plugins ] [test]
loaded [mapper-attachments, analysis-icu]
[2011-07-07 06:06:17,724][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initialized

On 7 Jul., 14:57, Lukáš Vlček lukas.vl...@gmail.com wrote:

Once we make sure that these two things are ok then we can look at
mappings
and searching in detail :slight_smile:

On Thu, Jul 7, 2011 at 2:54 PM, Lukáš Vlček lukas.vl...@gmail.com
wrote:

Hi,

can you double check that the attachment plugin is installed
correctly?

  1. make sure both tika and mapper attachment jar files are in
    <ES_HOME>/lib

folder

For example if you user 0.17.0-SNAPSHOT you should see the
following

two

files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs and check
ES

startup log records sequence. You should see something like

[2011-06-22 11:44:38,456][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initializing

...

[2011-06-22 11:44:38,481][INFO ][plugins ] [Amber
Hunt]

loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initialized

[2011-06-22 11:44:41,322][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: starting
...

...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing and
storaging

solution. I use the following general settings with analysis-icu
and

mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/'-d'{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" : "attachment" }
                   }
         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the content
is

created by "base64 SOURCE.pdf > TARGET"

{

"files" : [
{
"_content_type" : "application/pdf",
"_name" : " LeParfait_disclaimer_fr.pdf",
"content" : " ...base64-encoded content.."
}
]
}

Unfortunately, the content field does not contain any keyword from
the

pdf (just the long string) and every search returns nothing. I
think,

I misunderstood something - I'm quite new to JSON/Java and similar
technique.

Thanks in advance

Regards,
Frifri


(Frifri) #10

You mean one has to choose UTF8? By the way I can't find any
information on this issue in the base64 man page...

On 9 Jul., 11:47, Alberto Paro alberto.p...@gmail.com wrote:

Pay attention to the correct base64 encoding. I had some problems to find the correct base64 encoding in implementing attachment support in pyes

Sent from my iPhone

On 07/lug/2011, at 22:04, Frifri henning.frie...@googlemail.com wrote:

Yeah, but at the moment I've really no clue. Isn't there any
possibility to verify whether the plugin is working or not? Or do you
have another idea what to try else?

On 7 Jul., 16:43, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

On Thu, Jul 7, 2011 at 4:05 PM, Frifri henning.frie...@googlemail.comwrote:

Hey,

1.) what I meant to say was that I had both plugins already installed
with the command you cited (bin/plugin install ....) and they seem to
be loaded as I've shown above. Nevertheless, the associated jar-files
are not in the lib/-directory, only in plugins/mapper-attachments.
Do you think I have to manually move or copy them?

I normally do not use plugin script for installation because I directly copy
related jar files into lib folder. However, using plugin script is probably
better way and as long as you can see those plugins are loaded during node
startup then they you did good job (assuming you made sure you installed
plugins on ALL nodes in the cluster).

2.) And you are right, in the logfiles I get an error if I'm trying to
index a pdf-file:

So probably we should try to find a simple working example because from the
quick glance at your mapping file it looks fine to me. Again I assume that
you index your json document into /testindex/testmap/

[2011-07-07 15:53:30,296][DEBUG][action.index ] [test]
[testindex][0], node[YEPolMwkRhe_eJdNAyuz4Q], [P], s[STARTED]: Failed
to execute [index {[testindex][testmap][1], source[]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive
xcontent from []
at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
181)
at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:
172)
at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:
401)
at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:
380)
at

org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:
278)
[......]

So this looks as if the plugin for treating the content isn't working.

Thanks again!
Frieder

On 7 Jul., 15:36, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

You can either install plugins manually by copying these files to the lib
folder or you can use bin/plugin install command as explained here:
http://www.elasticsearch.org/guide/reference/index-modules/analysis/i...

Once you have both plugins correctly installed and ES is loading them
during
startup then you should be ready to use them. Did you try it again? What
was
the result now? Any exceptions in log files when indexing attachments or
when searching them?

Regards,
Lukas

On Thu, Jul 7, 2011 at 3:09 PM, Frifri <henning.frie...@googlemail.com
wrote:

Thanks you very much for your support.

I use the precompiled ES 0.16.2 and the two files weren't there, but I
found them to be in ESHOME/plugins/mapper-attachments/ and just copied
them to lib/.
Perhaps that helps somehow?!

And both plugins seem to be properly installed:

[2011-07-07 06:06:09,682][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initializing ...
[2011-07-07 06:06:09,711][INFO ][plugins ] [test]
loaded [mapper-attachments, analysis-icu]
[2011-07-07 06:06:17,724][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initialized

On 7 Jul., 14:57, Lukáš Vlček lukas.vl...@gmail.com wrote:

Once we make sure that these two things are ok then we can look at
mappings
and searching in detail :slight_smile:

On Thu, Jul 7, 2011 at 2:54 PM, Lukáš Vlček lukas.vl...@gmail.com
wrote:

Hi,

can you double check that the attachment plugin is installed
correctly?

  1. make sure both tika and mapper attachment jar files are in
    <ES_HOME>/lib

folder

For example if you user 0.17.0-SNAPSHOT you should see the
following

two

files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs and check
ES

startup log records sequence. You should see something like

[2011-06-22 11:44:38,456][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initializing

...

[2011-06-22 11:44:38,481][INFO ][plugins ] [Amber
Hunt]

loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initialized

[2011-06-22 11:44:41,322][INFO ][node ] [Amber
Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]: starting
...

...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing and
storaging

solution. I use the following general settings with analysis-icu
and

mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/'-d'{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" : "attachment" }
                   }
         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the content
is

created by "base64 SOURCE.pdf > TARGET"

{

"files" : [
{
"_content_type" : "application/pdf",
"_name" : " LeParfait_disclaimer_fr.pdf",
"content" : " ...base64-encoded content.."
}
]
}

Unfortunately, the content field does not contain any keyword from
the

pdf (just the long string) and every search returns nothing. I
think,

I misunderstood something - I'm quite new to JSON/Java and similar
technique.

Thanks in advance

Regards,
Frifri


(Lukáš Vlček) #11

Hi,

I created a gist of a simple shell script for attachment plugin. HTH. Note
it requires Perl.

Regards,
Lukas

On Sun, Jul 10, 2011 at 4:51 PM, Frifri henning.frieder@googlemail.comwrote:

You mean one has to choose UTF8? By the way I can't find any
information on this issue in the base64 man page...

On 9 Jul., 11:47, Alberto Paro alberto.p...@gmail.com wrote:

Pay attention to the correct base64 encoding. I had some problems to find
the correct base64 encoding in implementing attachment support in pyes

Sent from my iPhone

On 07/lug/2011, at 22:04, Frifri henning.frie...@googlemail.com wrote:

Yeah, but at the moment I've really no clue. Isn't there any
possibility to verify whether the plugin is working or not? Or do you
have another idea what to try else?

On 7 Jul., 16:43, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

On Thu, Jul 7, 2011 at 4:05 PM, Frifri <
henning.frie...@googlemail.com>wrote:

Hey,

1.) what I meant to say was that I had both plugins already installed
with the command you cited (bin/plugin install ....) and they seem to
be loaded as I've shown above. Nevertheless, the associated jar-files
are not in the lib/-directory, only in plugins/mapper-attachments.
Do you think I have to manually move or copy them?

I normally do not use plugin script for installation because I
directly copy

related jar files into lib folder. However, using plugin script is
probably

better way and as long as you can see those plugins are loaded during
node

startup then they you did good job (assuming you made sure you
installed

plugins on ALL nodes in the cluster).

2.) And you are right, in the logfiles I get an error if I'm trying
to

index a pdf-file:

So probably we should try to find a simple working example because
from the

quick glance at your mapping file it looks fine to me. Again I assume
that

you index your json document into /testindex/testmap/

[2011-07-07 15:53:30,296][DEBUG][action.index ] [test]
[testindex][0], node[YEPolMwkRhe_eJdNAyuz4Q], [P], s[STARTED]: Failed
to execute [index {[testindex][testmap][1], source[]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive
xcontent from []
at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:

  1. at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:

  1. at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:

  1. at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:

  1. at

org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:

[......]

So this looks as if the plugin for treating the content isn't
working.

Thanks again!
Frieder

On 7 Jul., 15:36, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

You can either install plugins manually by copying these files to
the lib

folder or you can use bin/plugin install command as explained here:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/i...

Once you have both plugins correctly installed and ES is loading
them

during

startup then you should be ready to use them. Did you try it again?
What

was

the result now? Any exceptions in log files when indexing
attachments or

when searching them?

Regards,
Lukas

On Thu, Jul 7, 2011 at 3:09 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Thanks you very much for your support.

I use the precompiled ES 0.16.2 and the two files weren't there,
but I

found them to be in ESHOME/plugins/mapper-attachments/ and just
copied

them to lib/.
Perhaps that helps somehow?!

And both plugins seem to be properly installed:

[2011-07-07 06:06:09,682][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initializing ...
[2011-07-07 06:06:09,711][INFO ][plugins ] [test]
loaded [mapper-attachments, analysis-icu]
[2011-07-07 06:06:17,724][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initialized

On 7 Jul., 14:57, Lukáš Vlček lukas.vl...@gmail.com wrote:

Once we make sure that these two things are ok then we can look at
mappings
and searching in detail :slight_smile:

On Thu, Jul 7, 2011 at 2:54 PM, Lukáš Vlček <
lukas.vl...@gmail.com>

wrote:

Hi,

can you double check that the attachment plugin is installed
correctly?

  1. make sure both tika and mapper attachment jar files are in
    <ES_HOME>/lib

folder

For example if you user 0.17.0-SNAPSHOT you should see the
following

two

files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs and
check

ES

startup log records sequence. You should see something like

[2011-06-22 11:44:38,456][INFO ][node ]
[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initializing

...

[2011-06-22 11:44:38,481][INFO ][plugins ]
[Amber

Hunt]

loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node ]
[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initialized

[2011-06-22 11:44:41,322][INFO ][node ]
[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
starting

...

...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing and
storaging

solution. I use the following general settings with analysis-icu
and

mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/'-d'{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" : "attachment"

}

                   }
         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the content
is

created by "base64 SOURCE.pdf > TARGET"

{

"files" : [
{
"_content_type" : "application/pdf",
"_name" : " LeParfait_disclaimer_fr.pdf",
"content" : " ...base64-encoded content.."
}
]
}

Unfortunately, the content field does not contain any keyword
from

the

pdf (just the long string) and every search returns nothing. I
think,

I misunderstood something - I'm quite new to JSON/Java and
similar

technique.

Thanks in advance

Regards,
Frifri


(Frifri) #12

Hey,

I tried your way and in general it works (means without error), but
the search returns still nothing.

So I verified row by row of your script and encountered that even
though perl is installed (sudo apt-get install perl), the row

coded=cat fn6742.pdf | perl -MMIME::Base64 -e 'print encode_base64($_)'

doesn't return any content stream. So I changed the row to:

coded=cat fn6742.pdf | base64 $_

and voilà, the variable "coded" contained the stream. Also the
indexing went well (according to logs) but there's still no result...

$ curl "${host}/_search?q=amplifier"
{"took":4,"timed_out":false,"_shards":{"total":3,"successful":
3,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

Perhaps the base64-command isn't compatible with the encoding needed
for ES?!

On 11 Jul., 00:59, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

I created a gist of a simple shell script for attachment plugin. HTH. Note
it requires Perl.

https://gist.github.com/1075067

Regards,
Lukas

On Sun, Jul 10, 2011 at 4:51 PM, Frifri henning.frie...@googlemail.comwrote:

You mean one has to choose UTF8? By the way I can't find any
information on this issue in the base64 man page...

On 9 Jul., 11:47, Alberto Paro alberto.p...@gmail.com wrote:

Pay attention to the correct base64 encoding. I had some problems to find
the correct base64 encoding in implementing attachment support in pyes

Sent from my iPhone

On 07/lug/2011, at 22:04, Frifri henning.frie...@googlemail.com wrote:

Yeah, but at the moment I've really no clue. Isn't there any
possibility to verify whether the plugin is working or not? Or do you
have another idea what to try else?

On 7 Jul., 16:43, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

On Thu, Jul 7, 2011 at 4:05 PM, Frifri <
henning.frie...@googlemail.com>wrote:

Hey,

1.) what I meant to say was that I had both plugins already installed
with the command you cited (bin/plugin install ....) and they seem to
be loaded as I've shown above. Nevertheless, the associated jar-files
are not in the lib/-directory, only in plugins/mapper-attachments.
Do you think I have to manually move or copy them?

I normally do not use plugin script for installation because I
directly copy

related jar files into lib folder. However, using plugin script is
probably

better way and as long as you can see those plugins are loaded during
node

startup then they you did good job (assuming you made sure you
installed

plugins on ALL nodes in the cluster).

2.) And you are right, in the logfiles I get an error if I'm trying
to

index a pdf-file:

So probably we should try to find a simple working example because
from the

quick glance at your mapping file it looks fine to me. Again I assume
that

you index your json document into /testindex/testmap/

[2011-07-07 15:53:30,296][DEBUG][action.index ] [test]
[testindex][0], node[YEPolMwkRhe_eJdNAyuz4Q], [P], s[STARTED]: Failed
to execute [index {[testindex][testmap][1], source[]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive
xcontent from []
at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:

  1. at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:

  1. at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:

  1. at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:

  1. at

org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:

[......]

So this looks as if the plugin for treating the content isn't
working.

Thanks again!
Frieder

On 7 Jul., 15:36, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

You can either install plugins manually by copying these files to
the lib

folder or you can use bin/plugin install command as explained here:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/i...

Once you have both plugins correctly installed and ES is loading
them

during

startup then you should be ready to use them. Did you try it again?
What

was

the result now? Any exceptions in log files when indexing
attachments or

when searching them?

Regards,
Lukas

On Thu, Jul 7, 2011 at 3:09 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Thanks you very much for your support.

I use the precompiled ES 0.16.2 and the two files weren't there,
but I

found them to be in ESHOME/plugins/mapper-attachments/ and just
copied

them to lib/.
Perhaps that helps somehow?!

And both plugins seem to be properly installed:

[2011-07-07 06:06:09,682][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initializing ...
[2011-07-07 06:06:09,711][INFO ][plugins ] [test]
loaded [mapper-attachments, analysis-icu]
[2011-07-07 06:06:17,724][INFO ][node ] [test]
{elasticsearch/0.16.2}[2897]: initialized

On 7 Jul., 14:57, Lukáš Vlček lukas.vl...@gmail.com wrote:

Once we make sure that these two things are ok then we can look at
mappings
and searching in detail :slight_smile:

On Thu, Jul 7, 2011 at 2:54 PM, Lukáš Vlček <
lukas.vl...@gmail.com>

wrote:

Hi,

can you double check that the attachment plugin is installed
correctly?

  1. make sure both tika and mapper attachment jar files are in
    <ES_HOME>/lib

folder

For example if you user 0.17.0-SNAPSHOT you should see the
following

two

files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs and
check

ES

startup log records sequence. You should see something like

[2011-06-22 11:44:38,456][INFO ][node ]
[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initializing

...

[2011-06-22 11:44:38,481][INFO ][plugins ]
[Amber

Hunt]

loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node ]
[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initialized

[2011-06-22 11:44:41,322][INFO ][node ]
[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
starting

...

...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing and
storaging

solution. I use the following general settings with analysis-icu
and

mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/'-d'{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" : "attachment"

}

                   }
         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the content
is

created by "base64 SOURCE.pdf > TARGET"

{

"files" : [
{
"_content_type" : "application/pdf",
"_name" : " LeParfait_disclaimer_fr.pdf",
"content" : " ...base64-encoded content.."
}
]
}

Unfortunately, the content field does not contain any keyword
from

the

pdf (just the long string) and every search returns nothing. I
think,

I misunderstood something - I'm quite new to JSON/Java and
similar

technique.

Thanks in advance

Regards,
Frifri


(Lukáš Vlček) #13

Hey,

sorry, there were some issues in the script. Fixed.

First issue was on the curl command (line 13) using "_mappings" instead of
"_mapping"
Second issue in the perl command, I was missing -n option. (line 29)
Then I added a refresh after indexing.
Finally, I enhanced mapping and the query to pull only the document title.

Regards,
Lukas

On Mon, Jul 11, 2011 at 12:25 PM, Frifri henning.frieder@googlemail.comwrote:

Hey,

I tried your way and in general it works (means without error), but
the search returns still nothing.

So I verified row by row of your script and encountered that even
though perl is installed (sudo apt-get install perl), the row

coded=cat fn6742.pdf | perl -MMIME::Base64 -e 'print encode_base64($_)'

doesn't return any content stream. So I changed the row to:

coded=cat fn6742.pdf | base64 $_

and voilà, the variable "coded" contained the stream. Also the
indexing went well (according to logs) but there's still no result...

$ curl "${host}/_search?q=amplifier"
{"took":4,"timed_out":false,"_shards":{"total":3,"successful":
3,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

Perhaps the base64-command isn't compatible with the encoding needed
for ES?!

On 11 Jul., 00:59, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

I created a gist of a simple shell script for attachment plugin. HTH.
Note
it requires Perl.

https://gist.github.com/1075067

Regards,
Lukas

On Sun, Jul 10, 2011 at 4:51 PM, Frifri <henning.frie...@googlemail.com
wrote:

You mean one has to choose UTF8? By the way I can't find any
information on this issue in the base64 man page...

On 9 Jul., 11:47, Alberto Paro alberto.p...@gmail.com wrote:

Pay attention to the correct base64 encoding. I had some problems to
find

the correct base64 encoding in implementing attachment support in pyes

Sent from my iPhone

On 07/lug/2011, at 22:04, Frifri henning.frie...@googlemail.com
wrote:

Yeah, but at the moment I've really no clue. Isn't there any
possibility to verify whether the plugin is working or not? Or do
you

have another idea what to try else?

On 7 Jul., 16:43, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

On Thu, Jul 7, 2011 at 4:05 PM, Frifri <
henning.frie...@googlemail.com>wrote:

Hey,

1.) what I meant to say was that I had both plugins already
installed

with the command you cited (bin/plugin install ....) and they
seem to

be loaded as I've shown above. Nevertheless, the associated
jar-files

are not in the lib/-directory, only in
plugins/mapper-attachments.

Do you think I have to manually move or copy them?

I normally do not use plugin script for installation because I
directly copy

related jar files into lib folder. However, using plugin script is
probably

better way and as long as you can see those plugins are loaded
during

node

startup then they you did good job (assuming you made sure you
installed

plugins on ALL nodes in the cluster).

2.) And you are right, in the logfiles I get an error if I'm
trying

to

index a pdf-file:

So probably we should try to find a simple working example because
from the

quick glance at your mapping file it looks fine to me. Again I
assume

that

you index your json document into /testindex/testmap/

[2011-07-07 15:53:30,296][DEBUG][action.index ]
[test]

[testindex][0], node[YEPolMwkRhe_eJdNAyuz4Q], [P], s[STARTED]:
Failed

to execute [index {[testindex][testmap][1], source[]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive
xcontent from []
at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:

  1. at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:

  1. at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:

  1. at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:

  1. at

org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:

[......]

So this looks as if the plugin for treating the content isn't
working.

Thanks again!
Frieder

On 7 Jul., 15:36, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

You can either install plugins manually by copying these files
to

the lib

folder or you can use bin/plugin install command as explained
here:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/i.
..

Once you have both plugins correctly installed and ES is loading
them

during

startup then you should be ready to use them. Did you try it
again?

What

was

the result now? Any exceptions in log files when indexing
attachments or

when searching them?

Regards,
Lukas

On Thu, Jul 7, 2011 at 3:09 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Thanks you very much for your support.

I use the precompiled ES 0.16.2 and the two files weren't
there,

but I

found them to be in ESHOME/plugins/mapper-attachments/ and just
copied

them to lib/.
Perhaps that helps somehow?!

And both plugins seem to be properly installed:

[2011-07-07 06:06:09,682][INFO ][node ]
[test]

{elasticsearch/0.16.2}[2897]: initializing ...
[2011-07-07 06:06:09,711][INFO ][plugins ]
[test]

loaded [mapper-attachments, analysis-icu]
[2011-07-07 06:06:17,724][INFO ][node ]
[test]

{elasticsearch/0.16.2}[2897]: initialized

On 7 Jul., 14:57, Lukáš Vlček lukas.vl...@gmail.com wrote:

Once we make sure that these two things are ok then we can
look at

mappings

and searching in detail :slight_smile:

On Thu, Jul 7, 2011 at 2:54 PM, Lukáš Vlček <
lukas.vl...@gmail.com>

wrote:

Hi,

can you double check that the attachment plugin is installed
correctly?

  1. make sure both tika and mapper attachment jar files are in
    <ES_HOME>/lib

folder

For example if you user 0.17.0-SNAPSHOT you should see the
following

two

files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs and
check

ES

startup log records sequence. You should see something like

[2011-06-22 11:44:38,456][INFO ][node ]
[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initializing

...

[2011-06-22 11:44:38,481][INFO ][plugins ]
[Amber

Hunt]

loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node ]
[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initialized

[2011-06-22 11:44:41,322][INFO ][node ]
[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
starting

...

...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing and
storaging

solution. I use the following general settings with
analysis-icu

and

mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/'-d'{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" :

"attachment"

}

                   }
         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the
content

is

created by "base64 SOURCE.pdf > TARGET"

{

"files" : [
{
"_content_type" : "application/pdf",
"_name" : " LeParfait_disclaimer_fr.pdf",
"content" : " ...base64-encoded content.."
}
]
}

Unfortunately, the content field does not contain any
keyword

from

the

pdf (just the long string) and every search returns nothing.
I

think,

I misunderstood something - I'm quite new to JSON/Java and
similar

technique.

Thanks in advance

Regards,
Frifri


(Frifri) #14

Thank you so much for your effort. It's finally working now!!

There's only one little issue, but not an error, only a warning -
though I don't think, it's important right now.

[2011-07-11 14:07:15,192][WARN ]
[org.apache.pdfbox.pdfparser.PDFParser] Parsing Error, Skipping Object
java.io.IOException: expected='endobj' firstReadAttempt=''
secondReadAttempt='' org.apache.pdfbox.io.PushBackInputStream@2a807bbc
at
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)

On 11 Jul., 13:10, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hey,

sorry, there were some issues in the script. Fixed.

First issue was on the curl command (line 13) using "_mappings" instead of
"_mapping"
Second issue in the perl command, I was missing -n option. (line 29)
Then I added a refresh after indexing.
Finally, I enhanced mapping and the query to pull only the document title.

Regards,
Lukas

On Mon, Jul 11, 2011 at 12:25 PM, Frifri henning.frie...@googlemail.comwrote:

Hey,

I tried your way and in general it works (means without error), but
the search returns still nothing.

So I verified row by row of your script and encountered that even
though perl is installed (sudo apt-get install perl), the row

coded=cat fn6742.pdf | perl -MMIME::Base64 -e 'print encode_base64($_)'

doesn't return any content stream. So I changed the row to:

coded=cat fn6742.pdf | base64 $_

and voilà, the variable "coded" contained the stream. Also the
indexing went well (according to logs) but there's still no result...

$ curl "${host}/_search?q=amplifier"
{"took":4,"timed_out":false,"_shards":{"total":3,"successful":
3,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

Perhaps the base64-command isn't compatible with the encoding needed
for ES?!

On 11 Jul., 00:59, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

I created a gist of a simple shell script for attachment plugin. HTH.
Note
it requires Perl.

https://gist.github.com/1075067

Regards,
Lukas

On Sun, Jul 10, 2011 at 4:51 PM, Frifri <henning.frie...@googlemail.com
wrote:

You mean one has to choose UTF8? By the way I can't find any
information on this issue in the base64 man page...

On 9 Jul., 11:47, Alberto Paro alberto.p...@gmail.com wrote:

Pay attention to the correct base64 encoding. I had some problems to
find

the correct base64 encoding in implementing attachment support in pyes

Sent from my iPhone

On 07/lug/2011, at 22:04, Frifri henning.frie...@googlemail.com
wrote:

Yeah, but at the moment I've really no clue. Isn't there any
possibility to verify whether the plugin is working or not? Or do
you

have another idea what to try else?

On 7 Jul., 16:43, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

On Thu, Jul 7, 2011 at 4:05 PM, Frifri <
henning.frie...@googlemail.com>wrote:

Hey,

1.) what I meant to say was that I had both plugins already
installed

with the command you cited (bin/plugin install ....) and they
seem to

be loaded as I've shown above. Nevertheless, the associated
jar-files

are not in the lib/-directory, only in
plugins/mapper-attachments.

Do you think I have to manually move or copy them?

I normally do not use plugin script for installation because I
directly copy

related jar files into lib folder. However, using plugin script is
probably

better way and as long as you can see those plugins are loaded
during

node

startup then they you did good job (assuming you made sure you
installed

plugins on ALL nodes in the cluster).

2.) And you are right, in the logfiles I get an error if I'm
trying

to

index a pdf-file:

So probably we should try to find a simple working example because
from the

quick glance at your mapping file it looks fine to me. Again I
assume

that

you index your json document into /testindex/testmap/

[2011-07-07 15:53:30,296][DEBUG][action.index ]
[test]

[testindex][0], node[YEPolMwkRhe_eJdNAyuz4Q], [P], s[STARTED]:
Failed

to execute [index {[testindex][testmap][1], source[]}]
org.elasticsearch.ElasticSearchParseException: Failed to derive
xcontent from []
at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:

  1. at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:

  1. at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:

  1. at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:

  1. at

org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:

[......]

So this looks as if the plugin for treating the content isn't
working.

Thanks again!
Frieder

On 7 Jul., 15:36, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

You can either install plugins manually by copying these files
to

the lib

folder or you can use bin/plugin install command as explained
here:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/i.
..

Once you have both plugins correctly installed and ES is loading
them

during

startup then you should be ready to use them. Did you try it
again?

What

was

the result now? Any exceptions in log files when indexing
attachments or

when searching them?

Regards,
Lukas

On Thu, Jul 7, 2011 at 3:09 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Thanks you very much for your support.

I use the precompiled ES 0.16.2 and the two files weren't
there,

but I

found them to be in ESHOME/plugins/mapper-attachments/ and just
copied

them to lib/.
Perhaps that helps somehow?!

And both plugins seem to be properly installed:

[2011-07-07 06:06:09,682][INFO ][node ]
[test]

{elasticsearch/0.16.2}[2897]: initializing ...
[2011-07-07 06:06:09,711][INFO ][plugins ]
[test]

loaded [mapper-attachments, analysis-icu]
[2011-07-07 06:06:17,724][INFO ][node ]
[test]

{elasticsearch/0.16.2}[2897]: initialized

On 7 Jul., 14:57, Lukáš Vlček lukas.vl...@gmail.com wrote:

Once we make sure that these two things are ok then we can
look at

mappings

and searching in detail :slight_smile:

On Thu, Jul 7, 2011 at 2:54 PM, Lukáš Vlček <
lukas.vl...@gmail.com>

wrote:

Hi,

can you double check that the attachment plugin is installed
correctly?

  1. make sure both tika and mapper attachment jar files are in
    <ES_HOME>/lib

folder

For example if you user 0.17.0-SNAPSHOT you should see the
following

two

files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs and
check

ES

startup log records sequence. You should see something like

[2011-06-22 11:44:38,456][INFO ][node ]
[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initializing

...

[2011-06-22 11:44:38,481][INFO ][plugins ]
[Amber

Hunt]

loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node ]
[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
initialized

[2011-06-22 11:44:41,322][INFO ][node ]
[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:
starting

...

...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing and
storaging

solution. I use the following general settings with
analysis-icu

and

mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/'-d'{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d '

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" :

"attachment"

}

                   }
         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the
content

is

...

Erfahren Sie mehr »


(Lukáš Vlček) #15

This exception comes from Tika. Generally speaking, parsing PDF documents is
a pain for anybody, not only for Tika (or pdfbox which Tika delegates to).
So the chance is that Tika is hitting some issues with this particular
document. The question is if this is serious or not. Note that if you are
seeing this when parsing the document from my example then this PDF document
contains a lot of complex charts and tables with texts. Typically such
documents (highly technical product sheets) are generated automatically from
product data store and some level of formatting issues are probably expected
(but I am just guessing here).

On Mon, Jul 11, 2011 at 2:32 PM, Frifri henning.frieder@googlemail.comwrote:

Thank you so much for your effort. It's finally working now!!

There's only one little issue, but not an error, only a warning -
though I don't think, it's important right now.

[2011-07-11 14:07:15,192][WARN ]
[org.apache.pdfbox.pdfparser.PDFParser] Parsing Error, Skipping Object
java.io.IOException: expected='endobj' firstReadAttempt=''
secondReadAttempt='' org.apache.pdfbox.io.PushBackInputStream@2a807bbc
at
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:607)

On 11 Jul., 13:10, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hey,

sorry, there were some issues in the script. Fixed.

First issue was on the curl command (line 13) using "_mappings" instead
of
"_mapping"
Second issue in the perl command, I was missing -n option. (line 29)
Then I added a refresh after indexing.
Finally, I enhanced mapping and the query to pull only the document
title.

Regards,
Lukas

On Mon, Jul 11, 2011 at 12:25 PM, Frifri <henning.frie...@googlemail.com
wrote:

Hey,

I tried your way and in general it works (means without error), but
the search returns still nothing.

So I verified row by row of your script and encountered that even
though perl is installed (sudo apt-get install perl), the row

coded=cat fn6742.pdf | perl -MMIME::Base64 -e 'print encode_base64($_)'

doesn't return any content stream. So I changed the row to:

coded=cat fn6742.pdf | base64 $_

and voilà, the variable "coded" contained the stream. Also the
indexing went well (according to logs) but there's still no result...

$ curl "${host}/_search?q=amplifier"
{"took":4,"timed_out":false,"_shards":{"total":3,"successful":
3,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

Perhaps the base64-command isn't compatible with the encoding needed
for ES?!

On 11 Jul., 00:59, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

I created a gist of a simple shell script for attachment plugin. HTH.
Note
it requires Perl.

https://gist.github.com/1075067

Regards,
Lukas

On Sun, Jul 10, 2011 at 4:51 PM, Frifri <
henning.frie...@googlemail.com

wrote:

You mean one has to choose UTF8? By the way I can't find any
information on this issue in the base64 man page...

On 9 Jul., 11:47, Alberto Paro alberto.p...@gmail.com wrote:

Pay attention to the correct base64 encoding. I had some problems
to

find

the correct base64 encoding in implementing attachment support in
pyes

Sent from my iPhone

On 07/lug/2011, at 22:04, Frifri <henning.frie...@googlemail.com

wrote:

Yeah, but at the moment I've really no clue. Isn't there any
possibility to verify whether the plugin is working or not? Or
do

you

have another idea what to try else?

On 7 Jul., 16:43, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

On Thu, Jul 7, 2011 at 4:05 PM, Frifri <
henning.frie...@googlemail.com>wrote:

Hey,

1.) what I meant to say was that I had both plugins already
installed

with the command you cited (bin/plugin install ....) and they
seem to

be loaded as I've shown above. Nevertheless, the associated
jar-files

are not in the lib/-directory, only in
plugins/mapper-attachments.

Do you think I have to manually move or copy them?

I normally do not use plugin script for installation because I
directly copy

related jar files into lib folder. However, using plugin
script is

probably

better way and as long as you can see those plugins are loaded
during

node

startup then they you did good job (assuming you made sure you
installed

plugins on ALL nodes in the cluster).

2.) And you are right, in the logfiles I get an error if I'm
trying

to

index a pdf-file:

So probably we should try to find a simple working example
because

from the

quick glance at your mapping file it looks fine to me. Again I
assume

that

you index your json document into /testindex/testmap/

[2011-07-07 15:53:30,296][DEBUG][action.index ]
[test]

[testindex][0], node[YEPolMwkRhe_eJdNAyuz4Q], [P],
s[STARTED]:

Failed

to execute [index {[testindex][testmap][1], source[]}]
org.elasticsearch.ElasticSearchParseException: Failed to
derive

xcontent from []
at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:

  1. at

org.elasticsearch.common.xcontent.XContentFactory.xContent(XContentFactory.java:

  1. at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:

  1. at

org.elasticsearch.index.mapper.xcontent.XContentDocumentMapper.parse(XContentDocumentMapper.java:

  1. at

org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:

[......]

So this looks as if the plugin for treating the content isn't
working.

Thanks again!
Frieder

On 7 Jul., 15:36, Lukáš Vlček lukas.vl...@gmail.com wrote:

Hi,

You can either install plugins manually by copying these
files

to

the lib

folder or you can use bin/plugin install command as
explained

here:

http://www.elasticsearch.org/guide/reference/index-modules/analysis/i.

..

Once you have both plugins correctly installed and ES is
loading

them

during

startup then you should be ready to use them. Did you try it
again?

What

was

the result now? Any exceptions in log files when indexing
attachments or

when searching them?

Regards,
Lukas

On Thu, Jul 7, 2011 at 3:09 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Thanks you very much for your support.

I use the precompiled ES 0.16.2 and the two files weren't
there,

but I

found them to be in ESHOME/plugins/mapper-attachments/ and
just

copied

them to lib/.
Perhaps that helps somehow?!

And both plugins seem to be properly installed:

[2011-07-07 06:06:09,682][INFO ][node ]
[test]

{elasticsearch/0.16.2}[2897]: initializing ...
[2011-07-07 06:06:09,711][INFO ][plugins ]
[test]

loaded [mapper-attachments, analysis-icu]
[2011-07-07 06:06:17,724][INFO ][node ]
[test]

{elasticsearch/0.16.2}[2897]: initialized

On 7 Jul., 14:57, Lukáš Vlček lukas.vl...@gmail.com
wrote:

Once we make sure that these two things are ok then we can
look at

mappings

and searching in detail :slight_smile:

On Thu, Jul 7, 2011 at 2:54 PM, Lukáš Vlček <
lukas.vl...@gmail.com>

wrote:

Hi,

can you double check that the attachment plugin is
installed

correctly?

  1. make sure both tika and mapper attachment jar files
    are in

<ES_HOME>/lib

folder

For example if you user 0.17.0-SNAPSHOT you should see
the

following

two

files in lib folder:
elasticsearch-mapper-attachments-0.17.0-SNAPSHOT.jar
tika-app-0.9.jar

  1. is attachment plugin recognized when ES startup?

You can see this in log file. Navigate to <ES_HOME>/logs
and

check

ES

startup log records sequence. You should see something
like

[2011-06-22 11:44:38,456][INFO ][node
]

[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:

initializing

...

[2011-06-22 11:44:38,481][INFO ][plugins
]

[Amber

Hunt]

loaded [mapper-attachments], sites []
[2011-06-22 11:44:41,322][INFO ][node
]

[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:

initialized

[2011-06-22 11:44:41,322][INFO ][node
]

[Amber

Hunt]

{elasticsearch/0.17.0-SNAPSHOT/2011-06-21T13:56:26}[6144]:

starting

...

...

Note the second row.

Regards,
Lukas

On Thu, Jul 7, 2011 at 2:37 PM, Frifri <
henning.frie...@googlemail.com

wrote:

Hey everybody,

at the moment I'm testing ES for the use as an indexing
and

storaging

solution. I use the following general settings with
analysis-icu

and

mapper-attachments plugins enabled:

curl -XPUT 'http://localhost:9200/testindex/'-d'{
"settings" : {
"index" : {
"number_of_shards" : 2,
"number_of_replicas" : 0
}
}
}'

and the following mapping for pdf indexing:

curl -XPUT localhost:9200/testindex/testmap/_mapping -d
'

   {
           "testmap": {
                   "properties" : {
                           "files" : { "type" :

"attachment"

}

                   }
         }
   }

'

Afterwards, i curl a merged XML-File to ES, in which the
content

is

...

Erfahren Sie mehr »


(Ramapriya N) #16

Hi,

I am using ELK 5.4 and unable to index pdf file. Is it because during ingest pluin installation had below warnings...

[=================================================] 100%
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: plugin requires additional permissions @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

  • java.lang.RuntimePermission getClassLoader
  • java.lang.reflect.ReflectPermission suppressAccessChecks
  • java.security.SecurityPermission createAccessControlContext
  • java.security.SecurityPermission insertProvider
  • java.security.SecurityPermission putProviderProperty.BC
    See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
    for descriptions of what these permissions allow and the associated risks.

Continue with installation? [y/N]

Is anyone also faced these warnings? Thanks.

Ramapriya


(David Pilato) #17

The warnings are expected as this requires some permissions to run.
But you should better open a new thread instead of answering on a thread which is out-dated.


(Ramapriya N) #18

thought, but the problem is still same.


(Ramapriya N) #19

Can you help me indexing pdf/docx files in ES 5.4 or point any link which has already achieved. thanks.


(David Pilato) #20

Can you open a new question please?