Design advice for ES side-by-side with hadoop cluster?

Hi Guys,

I have a 16 node hadoop cluster, running Cloudera's community edition. All
16 nodes are big powerful boxes with lots of disk.

I would like to add ES to this cluster, but would like advice on how to
configure/design the ES cluster.

I bulk load my data using PIG, which means Map-Reduce. What are the
thoughts on reducers against ES master nodes? Should I restrict me reducers
to match ES master nodes?

Any thoughts of advice? At the moment my standard MR parameters kill the ES
nodes.

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2d541d83-5a79-4bc1-b5da-11065b9b568a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

On 9/1/14 4:51 PM, bob.webman@gmail.com wrote:

Hi Guys,

I have a 16 node hadoop cluster, running Cloudera's community edition. All 16 nodes are big powerful boxes with lots of
disk.

Can you provide some actual numbers? How much RAM per machine - how much is allocated to Hadoop, how much to ES and how
much is actually free
(and thus usable by the OS)?

I would like to add ES to this cluster, but would like advice on how to configure/design the ES cluster.

There's a lot of useful information out there - I'll point to two great webinars, namely the pre-flight checklist [1]
and getting started with
Elasticsearch [2]

Especially in I/O intensive environments, make sure the OS has enough RAM and that the file-system cache is not trashed
since it has a big impact
(not just on ES but everything that accesses the disk).

I bulk load my data using PIG, which means Map-Reduce. What are the thoughts on reducers against ES master nodes? Should
I restrict me reducers to match ES master nodes?

Are you using Elasticsearch Hadoop? I'm asking since it's not the master nodes that matter but rather the data nodes.
es-hadoop automatically writes
only to those nodes. Depending on how big is your bulk size and the number of reducers vs your cluster size, you can
might be forced to limit the number
of tasks to avoid overloading the ES cluster.

Any thoughts of advice? At the moment my standard MR parameters kill the ES nodes.

See above - how are you writing the data to ES? How many shards do you have in the target index and what's the number of
reducers writing to it at a certain point?
Marvel by the way (or any monitoring tool) helps a lot here since it eliminates guesses and actually offers insight into
what's going on.

From the looks of it, it sounds like you are throwing too much data at one to the ES cluster and not retrying or
adjusting the bulk.

Otherwise, consider using es-hadoop. Run a job, take a look at the metrics [3] and tune it accordingly. See also the
troubleshooting page [4].

One last thing - make sure you use the latest ES - it has a lot of improvements.

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic
[2] Elasticsearch Platform — Find real-time answers at scale | Elastic
[3] Elasticsearch Platform — Find real-time answers at scale | Elastic
[4] Elasticsearch Platform — Find real-time answers at scale | Elastic

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2d541d83-5a79-4bc1-b5da-11065b9b568a%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2d541d83-5a79-4bc1-b5da-11065b9b568a%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/54049C1B.7060108%40gmail.com.
For more options, visit https://groups.google.com/d/optout.