Pseudo map-reduce for searchresults

Eks_Dev · March 15, 2011, 5:19pm

Is there a way to add some kind of plug-in support that is doing "user
specific magic" on a results list.

An example, we need to process search results, modify and ship them.
One can do it @client side, but it does not scale.

What I am talking about is pushing some functionality down to shards
in order to "enrich results".... think clustering, deduplication (kind
of map reduce at search request level). This way we distribute cpu
load to shards, and client needs only to merge.

Any idea how to do this?

I am totally new to ES, but so far it looks great!
Cheers,
Eks

ppearcy · March 15, 2011, 7:28pm

This may not satisfy your case, but the built in way to push logic
down into the engine is via script fields. This can only work with the
data on a single document, though. I don't think it is possible to
push logic like deduplication, which spans multiple docs into the
engine.

An approach we are considering (external to ES) for this is to have a
primary index and then multiple view indexes where the data is
preprocessed into the secondary indexes based on rules.

On Mar 15, 11:19 am, eks dev eks...@googlemail.com wrote:

Is there a way to add some kind of plug-in support that is doing "user
specific magic" on a results list.

An example, we need to process search results, modify and ship them.
One can do it @client side, but it does not scale.

What I am talking about is pushing some functionality down to shards
in order to "enrich results".... think clustering, deduplication (kind
of map reduce at search request level). This way we distribute cpu
load to shards, and client needs only to merge.

Any idea how to do this?

I am totally new to ES, but so far it looks great!
Cheers,
Eks

Eks_Dev · March 15, 2011, 8:12pm

script is a cute little feature, but unfortunately does not do the
job.

External solution is exactly what i am trying to avoid by using ES.
What I really like with ES is this dual-thinking, data is kept safely
local to shards.

Just compare a pain in setting and maintaining setup,

external NoSQl store, like hbase/cassandra...
solr cloud/homo grown something/
and keeping all these in sync against ES on steroids

Or you doble the storage (once in lucene and once in DB) or you hope
your store offers magic performance on multiget 3k hits.... you need
to roll your own servers to do heavy lifting for processing.... etc.

It is doable of course, but having just a couple of features more, one
could do it right with ES. But this is also a question for Shai if he
wants to push ES into this direction

map() is implicit by shards
reduce() should be provided by user as a library implementing
predefined API
combine() would be nice as well
some nice infrastructure to select query type (if and which reduce/
combine phase is to be invoked...), maintain plug-ins...

sounds doable

kimchy · March 16, 2011, 12:15am

Heya,

This is certainly something that is on the roadmap. Let me explain some of the things that require thinking when implementing it:

Facets are already doing map reduce over the hits matching a query, so the infra for (smaller) results map reduce is already there. If you have a look at the code, you will see that facets are quite modularized, and its not that difficult to implement a new custom one, though, still not to the level of providing it as a public API. More over, when implemented, I would like to allow for Javascript / mvel / python / groovy scripts as pluggable map reduce implementations.
Another aspect is when running a map reduce job that can result in a large amount of data. Both transferred back to the reduce phase, and sent back to the client. This requires an implementation of a streaming functionality in elasticsearch to support that.
The above, 1 and 2, talk about having map reduce implemented on the "search" aspect. One thing that I would love to also tackle is the "terms" aspect of a search engine. Being able to run (streaming) map reduce jobs on terms, especially ones with term vector information, can provide a strong infrastructure for implementing algos like clustering and the like.

So, yes, it has crossed my mind :), and it is on the roadmap.

-shay.banon
On Tuesday, March 15, 2011 at 10:12 PM, eks dev wrote:
script is a cute little feature, but unfortunately does not do the

job.

External solution is exactly what i am trying to avoid by using ES.
What I really like with ES is this dual-thinking, data is kept safely
local to shards.

Just compare a pain in setting and maintaining setup,

external NoSQl store, like hbase/cassandra...

solr cloud/homo grown something/
and keeping all these in sync against ES on steroids

Or you doble the storage (once in lucene and once in DB) or you hope
your store offers magic performance on multiget 3k hits.... you need
to roll your own servers to do heavy lifting for processing.... etc.

It is doable of course, but having just a couple of features more, one
could do it right with ES. But this is also a question for Shai if he
wants to push ES into this direction

map() is implicit by shards

reduce() should be provided by user as a library implementing
predefined API
combine() would be nice as well
some nice infrastructure to select query type (if and which reduce/
combine phase is to be invoked...), maintain plug-ins...

sounds doable

Eks_Dev · March 16, 2011, 9:39am

Thanks Shay,
I will have a look at the code and come back if I figure out something
useful.

re 3. Terms: good thinking, covering both dimensions {Terms X
Documents}!

Another nice to have in this area wouild be "collection mutator".
Example index sorting. We use extensive index sorting in order to keep
performance in check. This helps quite offten in case you have
(optionaly many) fields with low cardinality, like user rigts ZIP-
Code... after sorting, Filters get really small so you can do heavy
caching, also locality of reference improves significantly, you could
do early query termination with much better results (relevance) .
There is some work in LUCENE-2482 on that. We do today full
collection sorting and periodic index rebuild (ouch!)

Another example would be deduplication (aggregation) on content, for
example you get a lot of documents in incremental fashion with
different IDs and the same content, one could reduce number of
documents in that case just by keeping one document and reference to
IDs (I have seen collections with 30 - 40% dupes!). This one can be
done today incrementally with a lot of ping pong.

But this is, somehow similar to 1,2 you mentioned (large reduce task
on Query (e.g. MatchAll)), just instead of delivering to the client,
apply reduce results directly on index (hence "mutator") ...

eks

On 16 Mrz., 01:15, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya,

This is certainly something that is on the roadmap. Let me explain some of the things that require thinking when implementing it:

Facets are already doing map reduce over the hits matching a query, so the infra for (smaller) results map reduce is already there. If you have a look at the code, you will see that facets are quite modularized, and its not that difficult to implement a new custom one, though, still not to the level of providing it as a public API. More over, when implemented, I would like to allow for Javascript / mvel / python / groovy scripts as pluggable map reduce implementations.

Another aspect is when running a map reduce job that can result in a large amount of data. Both transferred back to the reduce phase, and sent back to the client. This requires an implementation of a streaming functionality in elasticsearch to support that.

The above, 1 and 2, talk about having map reduce implemented on the "search" aspect. One thing that I would love to also tackle is the "terms" aspect of a search engine. Being able to run (streaming) map reduce jobs on terms, especially ones with term vector information, can provide a strong infrastructure for implementing algos like clustering and the like.

So, yes, it has crossed my mind :), and it is on the roadmap.

-shay.banonOn Tuesday, March 15, 2011 at 10:12 PM, eks dev wrote:

script is a cute little feature, but unfortunately does not do the

job.

External solution is exactly what i am trying to avoid by using ES.
What I really like with ES is this dual-thinking, data is kept safely
local to shards.

Just compare a pain in setting and maintaining setup,

external NoSQl store, like hbase/cassandra...

solr cloud/homo grown something/
and keeping all these in sync against ES on steroids

Or you doble the storage (once in lucene and once in DB) or you hope
your store offers magic performance on multiget 3k hits.... you need
to roll your own servers to do heavy lifting for processing.... etc.

It is doable of course, but having just a couple of features more, one
could do it right with ES. But this is also a question for Shai if he
wants to push ES into this direction

map() is implicit by shards

reduce() should be provided by user as a library implementing
predefined API
combine() would be nice as well
some nice infrastructure to select query type (if and which reduce/
combine phase is to be invoked...), maintain plug-ins...

sounds doable