Have a look at the attachment mapper plugin, it can perform PDF text
extraction. From what I can
see https://issues.apache.org/jira/secure/attachment/12538782/NUTCH-1445.patch
you should be able to create the ES index settings and mappings before
Nutch starts. I don't know the fields Nutch uses, but you may be able to
find out how to hook the attachment mapper plugin to the Nutch pdf field.
Do I understand correctly, you want to detect languages? This could be
another feature of the Tika integration in the attachment mapper or a
separate feature. Or maybe another plugin for language detection, because
Tika language detection performance should be compared to
Google Code Archive - Long-term storage for Google Code Project Hosting. and maybe others. See also
Changing Bits: Accuracy and performance of Google's Compact Language Detector
Note, with the combo analyzer
plugin GitHub - yakaz/elasticsearch-analysis-combo: Elasticsearch Combo Analyzer you can
process the PDF text with as many analyzers as you want, this helped me a
lot for multilingual indexing into a single field.
For doing more things beside language detection, I am afraid generic
content scripting is still an open feature. Processing PDF text or other
content with scripts, either on nutch elasticwriter client side or on ES
mapper server side, would be another feature request. Or am I wrong?
Best regards,
Jörg
On Sunday, November 4, 2012 1:22:15 AM UTC+1, Rogerio Pereira wrote:
Hi,
I would like to know if there's a mechanism on elasticsearch that allows
me to modify a document fields contents based in a custom rule.
Solr has a feature called updateprocessor, where we can use a several
languages like python, javascript and ruby to create scripts that can
perform any kind of document manipulation before indexing.
Let me explain my scenario a little better, I'm using nutch and its brand
new elasticindex command from which we can push crawled documents into
elasticsearch, actually I'm trying to create some facets on elasticsearch
based on fields that are fed based in another field content, this field is
analysed by a custom script that loads a list of word from a script and try
to find then on this content, eg: a content field with the pdf text and a
script which load up a list of languages, find then on content field and if
found set the language field will all languages found on source field.
Is this kind of mechanism exists on elasticsearch?
Please let me know if I wasn't clear.
Thanks for any answer
Rogério
--