Best practices for Integration of ES-Hadoop to CDH5

Hi All,

I have a small CDH5 cluster - 1, manager, 2 name nodes and 4 data nodes all are stand alone servers. What I'd like to do is to add ES-Hadoop, Kibana and FluentD to the cluster. Should ES-Hadoop be installed on all nodes or is it sufficient to install it on a single node?


It depends on how beefy your hardware is and what are your requirements. ES is quite flexible so that it can support different topologies:

  • you can install it on separate hardware (aka HW); the pros here is that it does not interfere with the HW for Hadoop and it (potentially) has dedicated HW itself (depends on how you configure HW). The downside is that data has to go across the network; depending on your setup this might not be an issue or it might (1Gbps is quite potent and many times even faster than traditional HDD).
  • you can install it next to the CDH nodes. You gain data locality however the HW will be shared by both CDH and ES. Both rely on memory so you would need plenty of it; however they are both IO dependent so you likely want different partitions for each of them.

You can also use a mixed scenario - some nodes of ES are sitting next to CDH, some on different HW. All scenarios are valid - the difference is in performance; the more dedicated HW, ES or Hadoop uses the better they performed; using shared HW has a better cost however in terms of IO things might not work so well (simply because there's going to be some contention and things like page cache will be thrashed faster).
Depending on your scenario, this might be important or not - it depends on too many factors and doing some basic benchmarks clarifies the situation.

Considering you are starting with a small cluster, I would allocate 1 node with enough memory in your cluster and take it from there; you can easily start another one (and ES will automatically take advantage of it with default settings - namely 5 shards per index) if needs be.