Map-Reduce in Elasticsearch (was: Use case)

Jörg,

that is indeed creative thinking. Map-Recude in Elasticsearch is an
idea that always provoked me. Though I am a bit skeptical about this
idea. Not because I do not believe ES developers would not be able to
implement something like that but I am not sure this would be the
right thing to do (despite the fact users would love to hear ES has
this feature).

One of the problems I see is that it might be hard to define what a
"simple" map-reduce job is and how it differs from "normal" map-reduce
job.

If I understand your idea with modified bulk indexing task then in
fact you say that for a single MR job you would have the cluster to
relocate (or re-send) all the data within the cluster based on user
defined ad-hoc key function (which can be of course different from
routing field value used for indexing). And it would magically sort it
by the key value. Wouldn't just that be a killer for the cluster? Not
to mention that you might hit limits on the "reduce" side where a
single node would not be able to hold all the values for the key or
you can hit resource limits on the "map" side because this side has to
hold "point-in-time" view until all the reducers are finished (because
if reducer fails, you would want to resent the data again).
How about the map output? Would you store it to disk or hold in JVM
heap space only? What if the node holding the map output dies and you
need to resend the data?
How about things like monitoring of running jobs, scheduling or
debugging of running jobs?
On top of all this, isn't it recommended not to run other extensive
processes on machines running ES? Because Lucene/ES is designed to
make use of all the processing power available on the machine? And
what else MR job in ES is than another expensive process? Sure you can
add roles to individual cluster nodes (some are used for search, some
only for MR jobs... etc) or you can throttle MR jobs (but this would
definitely have tradeoffs).

As I said, I do believe it might be technically possible to implement
all this but I am afraid the amount of coding and added complexity
(even for ~"simple" MR jobs) in ES code base might not be worth it.
(i.e. why to spend an effort comparable to new Hadoop implementation
to get something that is less valuable than Hadoop?)

Anyway, I would love to hear ideas/opinions from others on this topic.

Regards,
Lukáš

On Thu, Aug 1, 2013 at 9:56 PM, joergprante@gmail.com
joergprante@gmail.com wrote:

Lukáš,

aggregations are one piece for certain map-reduce-like algorithms, it could be accompanied by a modified bulk indexing actionfor presorting data so they arrive in-place at the shards to get them effectively processed by the aggregation framework.

I agree that ES will never be Hadoop but I see no reason why the ES distributed architecture should not be extensible to run some sort of simple map-reduce analysis on indexed documents. For example, creating ordered lists of all the values of all fields, for statistics. For example, in bibliographic data, librarians tend to ask how many occurrences of values are in what fields and what fields are less used.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I hope I can clarify my raw thoughts a little bit.

Imagine a scenario for a data warehouse. A "simple" task would be loading
all the data, counting frequencies of values, in what fields are what
values, how do they correlate etc.

A non MR approach would be just indexing the data and firing off a sequence
of queries, until the desired result has been found.

For statistical counters /metrics, this could be many queries, a tedious
task.

Instead, a modified bulk indexer could precompute statististical
doc-specific counters / metrics (or other attributes analyzed by scripts)
on each doc and put them in an extra doc ("map phase").

Then, ES could route these extra docs to nodes which have been assigned by
configuration for special computational tasks ("shuffle phase"). The
routing may also be derived from the values in a doc. These nodes shall
execute "shuffle scripts", they could sort data quantities, create data
subsets to reorder or analyze them, whatever. Beside "data" and "client"
nodes, a new node type "comp" could appear. How these "comp" nodes can be
addressed by scripts has to be specified. In the end of the computation,
the "comp" nodes shall hold documents like "data" nodes already do, ready
for being queried. So, "comp" nodes are like "data" nodes, but with a
special thread pool for executing scripts in batch mode. The "comp" node
status could be queried by API.

Then, clients could formulate queries that match the schema of the
previously defined statistical counters /metrics ("reduce phase"). These
queries could simply show or aggregate results from the "comp" nodes.

The MR is "simple" in the sense that actual ES action framework and
scripting is re-used to simulate MR like behaviour, including the future
aggregation framework. No additional MR toolset, no additional MR exception
handling, etc. - just re-using the common ES features as much as possible.

Jörg

On Tue, Aug 13, 2013 at 3:04 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Jörg,

that is indeed creative thinking. Map-Recude in Elasticsearch is an
idea that always provoked me. Though I am a bit skeptical about this
idea. Not because I do not believe ES developers would not be able to
implement something like that but I am not sure this would be the
right thing to do (despite the fact users would love to hear ES has
this feature).

One of the problems I see is that it might be hard to define what a
"simple" map-reduce job is and how it differs from "normal" map-reduce
job.

If I understand your idea with modified bulk indexing task then in
fact you say that for a single MR job you would have the cluster to
relocate (or re-send) all the data within the cluster based on user
defined ad-hoc key function (which can be of course different from
routing field value used for indexing). And it would magically sort it
by the key value. Wouldn't just that be a killer for the cluster? Not
to mention that you might hit limits on the "reduce" side where a
single node would not be able to hold all the values for the key or
you can hit resource limits on the "map" side because this side has to
hold "point-in-time" view until all the reducers are finished (because
if reducer fails, you would want to resent the data again).
How about the map output? Would you store it to disk or hold in JVM
heap space only? What if the node holding the map output dies and you
need to resend the data?
How about things like monitoring of running jobs, scheduling or
debugging of running jobs?
On top of all this, isn't it recommended not to run other extensive
processes on machines running ES? Because Lucene/ES is designed to
make use of all the processing power available on the machine? And
what else MR job in ES is than another expensive process? Sure you can
add roles to individual cluster nodes (some are used for search, some
only for MR jobs... etc) or you can throttle MR jobs (but this would
definitely have tradeoffs).

As I said, I do believe it might be technically possible to implement
all this but I am afraid the amount of coding and added complexity
(even for ~"simple" MR jobs) in ES code base might not be worth it.
(i.e. why to spend an effort comparable to new Hadoop implementation
to get something that is less valuable than Hadoop?)

Anyway, I would love to hear ideas/opinions from others on this topic.

Regards,
Lukáš

On Thu, Aug 1, 2013 at 9:56 PM, joergprante@gmail.com
joergprante@gmail.com wrote:

Lukáš,

aggregations are one piece for certain map-reduce-like algorithms, it
could be accompanied by a modified bulk indexing actionfor presorting data
so they arrive in-place at the shards to get them effectively processed by
the aggregation framework.

I agree that ES will never be Hadoop but I see no reason why the ES
distributed architecture should not be extensible to run some sort of
simple map-reduce analysis on indexed documents. For example, creating
ordered lists of all the values of all fields, for statistics. For example,
in bibliographic data, librarians tend to ask how many occurrences of
values are in what fields and what fields are less used.

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.