Jörg,
that is indeed creative thinking. Map-Recude in Elasticsearch is an
idea that always provoked me. Though I am a bit skeptical about this
idea. Not because I do not believe ES developers would not be able to
implement something like that but I am not sure this would be the
right thing to do (despite the fact users would love to hear ES has
this feature).
One of the problems I see is that it might be hard to define what a
"simple" map-reduce job is and how it differs from "normal" map-reduce
job.
If I understand your idea with modified bulk indexing task then in
fact you say that for a single MR job you would have the cluster to
relocate (or re-send) all the data within the cluster based on user
defined ad-hoc key function (which can be of course different from
routing field value used for indexing). And it would magically sort it
by the key value. Wouldn't just that be a killer for the cluster? Not
to mention that you might hit limits on the "reduce" side where a
single node would not be able to hold all the values for the key or
you can hit resource limits on the "map" side because this side has to
hold "point-in-time" view until all the reducers are finished (because
if reducer fails, you would want to resent the data again).
How about the map output? Would you store it to disk or hold in JVM
heap space only? What if the node holding the map output dies and you
need to resend the data?
How about things like monitoring of running jobs, scheduling or
debugging of running jobs?
On top of all this, isn't it recommended not to run other extensive
processes on machines running ES? Because Lucene/ES is designed to
make use of all the processing power available on the machine? And
what else MR job in ES is than another expensive process? Sure you can
add roles to individual cluster nodes (some are used for search, some
only for MR jobs... etc) or you can throttle MR jobs (but this would
definitely have tradeoffs).
As I said, I do believe it might be technically possible to implement
all this but I am afraid the amount of coding and added complexity
(even for ~"simple" MR jobs) in ES code base might not be worth it.
(i.e. why to spend an effort comparable to new Hadoop implementation
to get something that is less valuable than Hadoop?)
Anyway, I would love to hear ideas/opinions from others on this topic.
Regards,
Lukáš
On Thu, Aug 1, 2013 at 9:56 PM, joergprante@gmail.com
joergprante@gmail.com wrote:
Lukáš,
aggregations are one piece for certain map-reduce-like algorithms, it could be accompanied by a modified bulk indexing actionfor presorting data so they arrive in-place at the shards to get them effectively processed by the aggregation framework.
I agree that ES will never be Hadoop but I see no reason why the ES distributed architecture should not be extensible to run some sort of simple map-reduce analysis on indexed documents. For example, creating ordered lists of all the values of all fields, for statistics. For example, in bibliographic data, librarians tend to ask how many occurrences of values are in what fields and what fields are less used.
Jörg
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.