Thanks Shay,
I will have a look at the code and come back if I figure out something
useful.
re 3. Terms: good thinking, covering both dimensions {Terms X
Documents}!
Another nice to have in this area wouild be "collection mutator".
Example index sorting. We use extensive index sorting in order to keep
performance in check. This helps quite offten in case you have
(optionaly many) fields with low cardinality, like user rigts ZIP-
Code... after sorting, Filters get really small so you can do heavy
caching, also locality of reference improves significantly, you could
do early query termination with much better results (relevance) .
There is some work in LUCENE-2482 on that. We do today full
collection sorting and periodic index rebuild (ouch!)
Another example would be deduplication (aggregation) on content, for
example you get a lot of documents in incremental fashion with
different IDs and the same content, one could reduce number of
documents in that case just by keeping one document and reference to
IDs (I have seen collections with 30 - 40% dupes!). This one can be
done today incrementally with a lot of ping pong.
But this is, somehow similar to 1,2 you mentioned (large reduce task
on Query (e.g. MatchAll)), just instead of delivering to the client,
apply reduce results directly on index (hence "mutator") ...
eks
On 16 Mrz., 01:15, Shay Banon shay.ba...@elasticsearch.com wrote:
Heya,
This is certainly something that is on the roadmap. Let me explain some of the things that require thinking when implementing it:
-
Facets are already doing map reduce over the hits matching a query, so the infra for (smaller) results map reduce is already there. If you have a look at the code, you will see that facets are quite modularized, and its not that difficult to implement a new custom one, though, still not to the level of providing it as a public API. More over, when implemented, I would like to allow for Javascript / mvel / python / groovy scripts as pluggable map reduce implementations.
-
Another aspect is when running a map reduce job that can result in a large amount of data. Both transferred back to the reduce phase, and sent back to the client. This requires an implementation of a streaming functionality in elasticsearch to support that.
-
The above, 1 and 2, talk about having map reduce implemented on the "search" aspect. One thing that I would love to also tackle is the "terms" aspect of a search engine. Being able to run (streaming) map reduce jobs on terms, especially ones with term vector information, can provide a strong infrastructure for implementing algos like clustering and the like.
So, yes, it has crossed my mind :), and it is on the roadmap.
-shay.banonOn Tuesday, March 15, 2011 at 10:12 PM, eks dev wrote:
script is a cute little feature, but unfortunately does not do the
job.
External solution is exactly what i am trying to avoid by using ES.
What I really like with ES is this dual-thinking, data is kept safely
local to shards.
Just compare a pain in setting and maintaining setup,
- external NoSQl store, like hbase/cassandra...
- solr cloud/homo grown something/
and keeping all these in sync against ES on steroids
Or you doble the storage (once in lucene and once in DB) or you hope
your store offers magic performance on multiget 3k hits.... you need
to roll your own servers to do heavy lifting for processing.... etc.
It is doable of course, but having just a couple of features more, one
could do it right with ES. But this is also a question for Shai if he
wants to push ES into this direction
- map() is implicit by shards
- reduce() should be provided by user as a library implementing
predefined API
combine() would be nice as well
some nice infrastructure to select query type (if and which reduce/
combine phase is to be invoked...), maintain plug-ins...
sounds doable