Document pre-processor


(Otis Gospodnetić) #1

Hello,

What is the best way to inject a custom document "processor" where one
could manipulate the document and its fields before the document gets
indexed?

This is what Solr has, for example:

http://search-lucene.com/jd/solr/org/apache/solr/update/processor/UpdateRequestProcessor.html
http://wiki.apache.org/solr/UpdateRequestProcessor

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


(Shay Banon) #2

There isn't really a hook point for this (i.e. get the content (json / xson)
and munge it). It can be added, but I think that it make more sense to do it
on the "client" side if one wish to do it, no? The fact that elasticsearch
supports json like structures natively (with deep level objects) means that
you can basically create that pre processor on your client side.

Another option is to create your own mapper that handles the part of the
data you want, but that also is I think more complicated than simply doing
it on the client side.

-shay.banon

On Mon, Jul 19, 2010 at 11:59 PM, Otis otis.gospodnetic@gmail.com wrote:

Hello,

What is the best way to inject a custom document "processor" where one
could manipulate the document and its fields before the document gets
indexed?

This is what Solr has, for example:

http://search-lucene.com/jd/solr/org/apache/solr/update/processor/UpdateRequestProcessor.html
http://wiki.apache.org/solr/UpdateRequestProcessor

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


(Otis Gospodnetić) #3

Hello,

I looked at http://www.elasticsearch.com/docs/elasticsearch/rest_api/index/
and at http://www.elasticsearch.com/docs/elasticsearch/index_modules/engine/robin/
, but couldn't find anything that sounds like Solr's
UpdateRequestProcessor.

Am I not looking in the right places?
Is this feature missing but already on the roadmap?

Thanks,
Otis

On Jul 19, 4:59 pm, Otis otis.gospodne...@gmail.com wrote:

Hello,

What is the best way to inject a custom document "processor" where one
could manipulate the document and its fields before the document gets
indexed?

This is what Solr has, for example:

http://search-lucene.com/jd/solr/org/apache/solr/update/processor/Upd...http://wiki.apache.org/solr/UpdateRequestProcessor

Thanks,
Otis

Sematext ::http://sematext.com/:: Solr - Lucene - Nutch
Lucene ecosystem search ::http://search-lucene.com/


(Shay Banon) #4

I responded to this, but it seems like you might have missed it, here is my
answer again:

There isn't really a hook point for this (i.e. get the content (json / xson)
and munge it). It can be added, but I think that it make more sense to do it
on the "client" side if one wish to do it, no? The fact that elasticsearch
supports json like structures natively (with deep level objects) means that
you can basically create that pre processor on your client side.

Another option is to create your own mapper that handles the part of the
data you want, but that also is I think more complicated than simply doing
it on the client side.

-shay.banon

On Wed, Jul 21, 2010 at 6:52 PM, Otis otis.gospodnetic@gmail.com wrote:

Hello,

I looked at
http://www.elasticsearch.com/docs/elasticsearch/rest_api/index/
and at
http://www.elasticsearch.com/docs/elasticsearch/index_modules/engine/robin/
, but couldn't find anything that sounds like Solr's
UpdateRequestProcessor.

Am I not looking in the right places?
Is this feature missing but already on the roadmap?

Thanks,
Otis

On Jul 19, 4:59 pm, Otis otis.gospodne...@gmail.com wrote:

Hello,

What is the best way to inject a custom document "processor" where one
could manipulate the document and its fields before the document gets
indexed?

This is what Solr has, for example:

http://search-lucene.com/jd/solr/org/apache/solr/update/processor/Upd...http://wiki.apache.org/solr/UpdateRequestProcessor

Thanks,
Otis

Sematext ::http://sematext.com/:: Solr - Lucene - Nutch
Lucene ecosystem search ::http://search-lucene.com/


(Otis Gospodnetić) #5

Hi,

On Jul 21, 12:54 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

I responded to this, but it seems like you might have missed it, here is my
answer again:

Check http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/228cf9757b53b74e

  • no reply there.

There isn't really a hook point for this (i.e. get the content (json / xson)
and munge it). It can be added, but I think that it make more sense to do it
on the "client" side if one wish to do it, no? The fact that elasticsearch
supports json like structures natively (with deep level objects) means that
you can basically create that pre processor on your client side.

Looks like that will have to be the route.

Another option is to create your own mapper that handles the part of the
data you want, but that also is I think more complicated than simply doing
it on the client side.

The nice part about it being in ES is that different clients don't
have to implement this or even be aware of documents being munged
before indexing.

Otis

On Wed, Jul 21, 2010 at 6:52 PM, Otis otis.gospodne...@gmail.com wrote:

Hello,

I looked at
http://www.elasticsearch.com/docs/elasticsearch/rest_api/index/
and at
http://www.elasticsearch.com/docs/elasticsearch/index_modules/engine/...
, but couldn't find anything that sounds like Solr's
UpdateRequestProcessor.

Am I not looking in the right places?
Is this feature missing but already on the roadmap?

Thanks,
Otis

On Jul 19, 4:59 pm, Otis otis.gospodne...@gmail.com wrote:

Hello,

What is the best way to inject a custom document "processor" where one
could manipulate the document and its fields before the document gets
indexed?

This is what Solr has, for example:

http://search-lucene.com/jd/solr/org/apache/solr/update/processor/Upd...

Thanks,
Otis

Sematext ::http://sematext.com/::Solr - Lucene - Nutch
Lucene ecosystem search ::http://search-lucene.com/


(Lukáš Vlček) #6

On Thu, Jul 22, 2010 at 1:05 AM, Otis otis.gospodnetic@gmail.com wrote:

Hi,

On Jul 21, 12:54 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

I responded to this, but it seems like you might have missed it, here is
my
answer again:

Check
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/228cf9757b53b74e

  • no reply there.

Interestingly, the reply is here:
http://elasticsearch-users.115913.n3.nabble.com/Document-pre-processor-tp979569p979702.html

There isn't really a hook point for this (i.e. get the content (json /
xson)
and munge it). It can be added, but I think that it make more sense to do
it
on the "client" side if one wish to do it, no? The fact that
elasticsearch
supports json like structures natively (with deep level objects) means
that
you can basically create that pre processor on your client side.

Looks like that will have to be the route.

Another option is to create your own mapper that handles the part of the
data you want, but that also is I think more complicated than simply
doing
it on the client side.

The nice part about it being in ES is that different clients don't
have to implement this or even be aware of documents being munged
before indexing.

Otis

On Wed, Jul 21, 2010 at 6:52 PM, Otis otis.gospodne...@gmail.com
wrote:

Hello,

I looked at
http://www.elasticsearch.com/docs/elasticsearch/rest_api/index/
and at
http://www.elasticsearch.com/docs/elasticsearch/index_modules/engine/.
..

, but couldn't find anything that sounds like Solr's
UpdateRequestProcessor.

Am I not looking in the right places?
Is this feature missing but already on the roadmap?

Thanks,
Otis

On Jul 19, 4:59 pm, Otis otis.gospodne...@gmail.com wrote:

Hello,

What is the best way to inject a custom document "processor" where
one

could manipulate the document and its fields before the document gets
indexed?

This is what Solr has, for example:

http://search-lucene.com/jd/solr/org/apache/solr/update/processor/Upd.
..

Thanks,
Otis

Sematext ::http://sematext.com/::Solr - Lucene - Nutch
Lucene ecosystem search ::http://search-lucene.com/


(Shay Banon) #7

Very strange, looks like google groups is acting up... .

Its a very valid point regarding the fact that you can have the same
pre-processing power for different clients if it is built into
elasticsearch. The way elasticsearch works with data is that the document
(json) passed is never built into memory (as a Map of Maps/Lists/... for
example), its pull parsed directly into a Lucene document (to same memory
and speed up indexing). So, in order to have a hook point that changes it
within elasticsearch, it will need to be built into memory, passed to the
extension point to munge it, them converted back to an xcontent format
(json) and then parse it again.

-shay.banon

On Thu, Jul 22, 2010 at 2:13 AM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

On Thu, Jul 22, 2010 at 1:05 AM, Otis otis.gospodnetic@gmail.com wrote:

Hi,

On Jul 21, 12:54 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

I responded to this, but it seems like you might have missed it, here is
my
answer again:

Check
http://groups.google.com/a/elasticsearch.com/group/users/browse_thread/thread/228cf9757b53b74e

  • no reply there.

Interestingly, the reply is here:
http://elasticsearch-users.115913.n3.nabble.com/Document-pre-processor-tp979569p979702.html

There isn't really a hook point for this (i.e. get the content (json /
xson)
and munge it). It can be added, but I think that it make more sense to
do it
on the "client" side if one wish to do it, no? The fact that
elasticsearch
supports json like structures natively (with deep level objects) means
that
you can basically create that pre processor on your client side.

Looks like that will have to be the route.

Another option is to create your own mapper that handles the part of the
data you want, but that also is I think more complicated than simply
doing
it on the client side.

The nice part about it being in ES is that different clients don't
have to implement this or even be aware of documents being munged
before indexing.

Otis

On Wed, Jul 21, 2010 at 6:52 PM, Otis otis.gospodne...@gmail.com
wrote:

Hello,

I looked at
http://www.elasticsearch.com/docs/elasticsearch/rest_api/index/
and at
http://www.elasticsearch.com/docs/elasticsearch/index_modules/engine/.
..

, but couldn't find anything that sounds like Solr's
UpdateRequestProcessor.

Am I not looking in the right places?
Is this feature missing but already on the roadmap?

Thanks,
Otis

On Jul 19, 4:59 pm, Otis otis.gospodne...@gmail.com wrote:

Hello,

What is the best way to inject a custom document "processor" where
one

could manipulate the document and its fields before the document
gets

indexed?

This is what Solr has, for example:

http://search-lucene.com/jd/solr/org/apache/solr/update/processor/Upd.
..

Thanks,
Otis

Sematext ::http://sematext.com/::Solr - Lucene - Nutch
Lucene ecosystem search ::http://search-lucene.com/


(system) #8