Decorating _search with additional data

Hi everyone,

I have a question related to ES internals (providing an extension to search
functionality).

We have a customer that would like to integrate clustering of search
results directly with ES so that it happens as part of the search. This
functionality is essentially identical to just plain searches, with some
additional parameters to determine the clustering algorithm to use, etc. I
know medcl already implemented a Carrot2 plugin for ES and I looked at his
code but for us we will need something more generic to also allow
proprietary clustering algorithms to be used with the plugin in a seamless
way. But back to the point.

I've been looking at the architecture and ways this could be accomplished
(by the way -- kudos to everyone involved, the code looks and works very
cool... bonsai cool) and have a few questions that popped up.

  1. It seems that the "nicest" way to accomplish the task in question would
    be to somehow plug into the search action, ideally as a SearchPhase (or
    rather a FetchSubPhase). These sub-phases currently seem to be fixed and
    not extensible.... and I can already see the problems with serialization if
    the search result is somehow augmented at this level. Do you think it's at
    all possible (and a good idea) to try to plug it in there?

  2. Since (1) seemed very intrusive I temporarily implemented a custom
    plugin (and an action/ request/ response pair). My code essentially does
    nothing but delegates most of its internal workings to Search*: currently
    all the "logic" that actually does the clustering resides in a subclass
    of TransportAction; in doExecute it delegates to TransportSearchAction,
    then inside onResponse it clusters the result and returns the augmented
    response back to the user.

This works fine but clustering is pretty heavy on computational resources
and I wondered if TransportAction is a good place to place this logic and
what threading (threadpool) magic should be used to make it fit with the
rest of ES.

Another problem is that the rest handler could be implemented in pretty
much the same way but the search-request parsing logic in
RestSearchAction#parseSearchRequest is currently private and there is no
way to reuse that (and I'd say it begs for reuse since it's far from
trivial and copy-paste will most likely go out of sync in future versions).

Thanks for all the tips and hints,
Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

  1. It seems that the "nicest" way to accomplish the task in question would
    be to somehow plug into the search action, ideally as a SearchPhase (or
    rather a FetchSubPhase). These sub-phases currently seem to be fixed and
    not extensible.... and I can already see the problems with serialization if
    the search result is somehow augmented at this level. Do you think it's at
    all possible (and a good idea) to try to plug it in there?

Other then forking the code base I don't see that adding your own
SearchPhase or FetchSubPhase is possible. There are a few extension points
in the codebase: for creating custom queries / filters, tokenizers and
token filters, discovery, facets and recently also suggesters and
highlighters.

  1. Since (1) seemed very intrusive I temporarily implemented a custom
    plugin (and an action/ request/ response pair). My code essentially does
    nothing but delegates most of its internal workings to Search*: currently
    all the "logic" that actually does the clustering resides in a subclass
    of TransportAction; in doExecute it delegates to TransportSearchAction,
    then inside onResponse it clusters the result and returns the augmented
    response back to the user.

I think in the current codebase this is the only way of creating the type
of plugin that you need.

This works fine but clustering is pretty heavy on computational resources
and I wondered if TransportAction is a good place to place this logic and
what threading (threadpool) magic should be used to make it fit with the
rest of ES.

Make sure that the clustering does never run on a network thread, so try to
offload to a thread from the threadpool. There are a few thread pools, I
think in your case you should use the search thread pool.

Another problem is that the rest handler could be implemented in pretty
much the same way but the search-request parsing logic in
RestSearchAction#parseSearchRequest is currently private and there is no
way to reuse that (and I'd say it begs for reuse since it's far from
trivial and copy-paste will most likely go out of sync in future versions).

I think the parseSearchRequest method can be made protected.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Dawid,

Is there any specific reason you are not interested in a Facet plugin?

In a application I worked on, we were needed responses similar to Carrot2
Lingo, the only difference was that our data was not search results, but
complete social media contents, blogs, news, comments, etc. Also the count
of such documents was huge, more than a million, and even a search would
fetch not less than few thousand docs.

We developed a custom elasticsearch facet plugin for the same, compatible
with ES 0.19.*. We derived a custom distance logic to detect document
similarities, and also derived our own full text clustering algorithm in a
map-reduce architecture. It is fully functional and pretty fast.

In fact we have plans to rewrite the same plugin with some improvements
compatible with ES 0.90.*, but that is still pending since we are occupied
with other tasks.

-- Sujoy.

On Wednesday, June 26, 2013 1:40:41 PM UTC+5:30, Dawid Weiss wrote:

Hi everyone,

I have a question related to ES internals (providing an extension to
search functionality).

We have a customer that would like to integrate clustering of search
results directly with ES so that it happens as part of the search. This
functionality is essentially identical to just plain searches, with some
additional parameters to determine the clustering algorithm to use, etc. I
know medcl already implemented a Carrot2 plugin for ES and I looked at his
code but for us we will need something more generic to also allow
proprietary clustering algorithms to be used with the plugin in a seamless
way. But back to the point.

I've been looking at the architecture and ways this could be accomplished
(by the way -- kudos to everyone involved, the code looks and works very
cool... bonsai cool) and have a few questions that popped up.

  1. It seems that the "nicest" way to accomplish the task in question would
    be to somehow plug into the search action, ideally as a SearchPhase (or
    rather a FetchSubPhase). These sub-phases currently seem to be fixed and
    not extensible.... and I can already see the problems with serialization if
    the search result is somehow augmented at this level. Do you think it's at
    all possible (and a good idea) to try to plug it in there?

  2. Since (1) seemed very intrusive I temporarily implemented a custom
    plugin (and an action/ request/ response pair). My code essentially does
    nothing but delegates most of its internal workings to Search*: currently
    all the "logic" that actually does the clustering resides in a subclass
    of TransportAction; in doExecute it delegates to TransportSearchAction,
    then inside onResponse it clusters the result and returns the augmented
    response back to the user.

This works fine but clustering is pretty heavy on computational resources
and I wondered if TransportAction is a good place to place this logic and
what threading (threadpool) magic should be used to make it fit with the
rest of ES.

Another problem is that the rest handler could be implemented in pretty
much the same way but the search-request parsing logic in
RestSearchAction#parseSearchRequest is currently private and there is no
way to reuse that (and I'd say it begs for reuse since it's far from
trivial and copy-paste will most likely go out of sync in future versions).

Thanks for all the tips and hints,
Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the hints, Martijn.

I think the parseSearchRequest method can be made protected.

For me it'd have to be public -- I don't want to subclass, I want to
keep my stuff separate and just delegate parsing of a fragment of my
request that I know is a search request. It'd be nice if it could be
made reusable I guess.

Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Is there any specific reason you are not interested in a Facet plugin?

For faceting you'd need to extract something and keep it in the index
(named entities, whatever). Clustering is dynamic and you don't need
to do that. There are pros and cons of both (and they are
complimentary most of the time).

Besides, I'm on of the authors of Carrot2 so I have an obvious reason
to stick to my stuff :slight_smile:

We derived a custom distance logic to detect document similarities, and also derived our own full text clustering algorithm in a map-reduce architecture. It is fully functional and pretty fast.

This would be an interesting piece of code in general (even for
Mahout). Let me know if you publish it somewhere, I'd be interested.

Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I would be interested too.

@Dawid: I just discovered Carrot2, great piece of site/software!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

+1 to making this extensible.

If I understand what Dawid did, I would think he would ideally not have to
write a custom Rest Action because that means the client needs to be
changed and told to go use this new Rest Action, which is not nice.
Ideally custom work being done before and/or after search is executed
would be transparent to the client.

I asked more or less the same question a few days ago but got no
replies.......... https://groups.google.com/d/msg/elasticsearch/VT_nl8Dwu7o/bUDwzsfxU0gJ

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html

On Friday, June 28, 2013 1:37:03 AM UTC-4, Christoph Evers wrote:

Hi,

I would be interested too.

@Dawid: I just discovered Carrot2, great piece of site/software!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

And there's great plugin of medcl too

Jörg

Am 28.06.13 07:38, schrieb Christoph Evers:

I would be interested too.

@Dawid: I just discovered Carrot2, great piece of site/software!

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I think the challenge is deeper. All actions in the REST API contain
parsing code to bridge between the transport-style encoded form and
Java. Imagine a non-REST API added to ES. It would have to re-implement
all parsing code in the REST action classes.

Making this reusable would mean to refactor all the protocol parsing
code and reduce the REST API classes to a mininum.

Jörg

Am 27.06.13 08:27, schrieb Dawid Weiss:

Thanks for the hints, Martijn.

I think the parseSearchRequest method can be made protected.
For me it'd have to be public -- I don't want to subclass, I want to
keep my stuff separate and just delegate parsing of a fragment of my
request that I know is a search request. It'd be nice if it could be
made reusable I guess.

Dawid

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.