How to modify field contents during indexing?


(Rogerio Pereira) #1

Hi,

I would like to know if there's a mechanism on elasticsearch that allows me
to modify a document fields contents based in a custom rule.

Solr has a feature called updateprocessor, where we can use a several
languages like python, javascript and ruby to create scripts that can
perform any kind of document manipulation before indexing.

Let me explain my scenario a little better, I'm using nutch and its brand
new elasticindex command from which we can push crawled documents into
elasticsearch, actually I'm trying to create some facets on elasticsearch
based on fields that are fed based in another field content, this field is
analysed by a custom script that loads a list of word from a script and try
to find then on this content, eg: a content field with the pdf text and a
script which load up a list of languages, find then on content field and if
found set the language field will all languages found on source field.

Is this kind of mechanism exists on elasticsearch?

Please let me know if I wasn't clear.

Thanks for any answer

Rogério

--


(Jörg Prante) #2

Have a look at the attachment mapper plugin, it can perform PDF text
extraction. From what I can
see https://issues.apache.org/jira/secure/attachment/12538782/NUTCH-1445.patch
you should be able to create the ES index settings and mappings before
Nutch starts. I don't know the fields Nutch uses, but you may be able to
find out how to hook the attachment mapper plugin to the Nutch pdf field.

Do I understand correctly, you want to detect languages? This could be
another feature of the Tika integration in the attachment mapper or a
separate feature. Or maybe another plugin for language detection, because
Tika language detection performance should be compared to
http://code.google.com/p/language-detection/ and maybe others. See also
http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

Note, with the combo analyzer
plugin https://github.com/yakaz/elasticsearch-analysis-combo you can
process the PDF text with as many analyzers as you want, this helped me a
lot for multilingual indexing into a single field.

For doing more things beside language detection, I am afraid generic
content scripting is still an open feature. Processing PDF text or other
content with scripts, either on nutch elasticwriter client side or on ES
mapper server side, would be another feature request. Or am I wrong?

Best regards,

Jörg

On Sunday, November 4, 2012 1:22:15 AM UTC+1, Rogerio Pereira wrote:

Hi,

I would like to know if there's a mechanism on elasticsearch that allows
me to modify a document fields contents based in a custom rule.

Solr has a feature called updateprocessor, where we can use a several
languages like python, javascript and ruby to create scripts that can
perform any kind of document manipulation before indexing.

Let me explain my scenario a little better, I'm using nutch and its brand
new elasticindex command from which we can push crawled documents into
elasticsearch, actually I'm trying to create some facets on elasticsearch
based on fields that are fed based in another field content, this field is
analysed by a custom script that loads a list of word from a script and try
to find then on this content, eg: a content field with the pdf text and a
script which load up a list of languages, find then on content field and if
found set the language field will all languages found on source field.

Is this kind of mechanism exists on elasticsearch?

Please let me know if I wasn't clear.

Thanks for any answer

Rogério

--


(Rogerio Pereira) #3

Hi Jörg, you is correct, I'm looking for content processing with scripts on
elasticsearch side, very similar to FAST ESP stages and Solr update
processors.

Sometimes I wan't to perform few extra content processing which can be only
done before or during index processing, like detecte named entities or
nouns or simply a dictionary lookup in a speficy field which will set an
value in another field.

Em domingo, 4 de novembro de 2012 12h44min08s UTC-2, Jörg Prante escreveu:

Have a look at the attachment mapper plugin, it can perform PDF text
extraction. From what I can see
https://issues.apache.org/jira/secure/attachment/12538782/NUTCH-1445.patchyou should be able to create the ES index settings and mappings before
Nutch starts. I don't know the fields Nutch uses, but you may be able to
find out how to hook the attachment mapper plugin to the Nutch pdf field.

Do I understand correctly, you want to detect languages? This could be
another feature of the Tika integration in the attachment mapper or a
separate feature. Or maybe another plugin for language detection, because
Tika language detection performance should be compared to
http://code.google.com/p/language-detection/ and maybe others. See also
http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

Note, with the combo analyzer plugin
https://github.com/yakaz/elasticsearch-analysis-combo you can process
the PDF text with as many analyzers as you want, this helped me a lot for
multilingual indexing into a single field.

For doing more things beside language detection, I am afraid generic
content scripting is still an open feature. Processing PDF text or other
content with scripts, either on nutch elasticwriter client side or on ES
mapper server side, would be another feature request. Or am I wrong?

Best regards,

Jörg

On Sunday, November 4, 2012 1:22:15 AM UTC+1, Rogerio Pereira wrote:

Hi,

I would like to know if there's a mechanism on elasticsearch that allows
me to modify a document fields contents based in a custom rule.

Solr has a feature called updateprocessor, where we can use a several
languages like python, javascript and ruby to create scripts that can
perform any kind of document manipulation before indexing.

Let me explain my scenario a little better, I'm using nutch and its brand
new elasticindex command from which we can push crawled documents into
elasticsearch, actually I'm trying to create some facets on elasticsearch
based on fields that are fed based in another field content, this field is
analysed by a custom script that loads a list of word from a script and try
to find then on this content, eg: a content field with the pdf text and a
script which load up a list of languages, find then on content field and if
found set the language field will all languages found on source field.

Is this kind of mechanism exists on elasticsearch?

Please let me know if I wasn't clear.

Thanks for any answer

Rogério

--


(Rogerio Pereira) #4

I believe elasticsearch-partialupdate plugin can give me a direction on how
to implement it.

Em segunda-feira, 5 de novembro de 2012 14h49min54s UTC-2, Rogerio Pereira
escreveu:

Hi Jörg, you is correct, I'm looking for content processing with scripts
on elasticsearch side, very similar to FAST ESP stages and Solr update
processors.

Sometimes I wan't to perform few extra content processing which can be
only done before or during index processing, like detecte named entities or
nouns or simply a dictionary lookup in a speficy field which will set an
value in another field.

Em domingo, 4 de novembro de 2012 12h44min08s UTC-2, Jörg Prante escreveu:

Have a look at the attachment mapper plugin, it can perform PDF text
extraction. From what I can see
https://issues.apache.org/jira/secure/attachment/12538782/NUTCH-1445.patchyou should be able to create the ES index settings and mappings before
Nutch starts. I don't know the fields Nutch uses, but you may be able to
find out how to hook the attachment mapper plugin to the Nutch pdf field.

Do I understand correctly, you want to detect languages? This could be
another feature of the Tika integration in the attachment mapper or a
separate feature. Or maybe another plugin for language detection, because
Tika language detection performance should be compared to
http://code.google.com/p/language-detection/ and maybe others. See also
http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

Note, with the combo analyzer plugin
https://github.com/yakaz/elasticsearch-analysis-combo you can process
the PDF text with as many analyzers as you want, this helped me a lot for
multilingual indexing into a single field.

For doing more things beside language detection, I am afraid generic
content scripting is still an open feature. Processing PDF text or other
content with scripts, either on nutch elasticwriter client side or on ES
mapper server side, would be another feature request. Or am I wrong?

Best regards,

Jörg

On Sunday, November 4, 2012 1:22:15 AM UTC+1, Rogerio Pereira wrote:

Hi,

I would like to know if there's a mechanism on elasticsearch that allows
me to modify a document fields contents based in a custom rule.

Solr has a feature called updateprocessor, where we can use a several
languages like python, javascript and ruby to create scripts that can
perform any kind of document manipulation before indexing.

Let me explain my scenario a little better, I'm using nutch and its
brand new elasticindex command from which we can push crawled documents
into elasticsearch, actually I'm trying to create some facets on
elasticsearch based on fields that are fed based in another field content,
this field is analysed by a custom script that loads a list of word from a
script and try to find then on this content, eg: a content field with the
pdf text and a script which load up a list of languages, find then on
content field and if found set the language field will all languages found
on source field.

Is this kind of mechanism exists on elasticsearch?

Please let me know if I wasn't clear.

Thanks for any answer

Rogério

--


(system) #5