Hadoop ES Integration Options & Performamnce

(Madhusudhana Rao Podila) #1


I am new to ES
I am actually looking at Elastic Search for the Big Data Analytics and looking at ES & Kibana; had also gone through the es-hadoop connector, wanted to push the data from Hadoop to ES and visualize though Kibana;
I see the following options

  1. Write directly into ES schema/Index through HIVE (storage as ES) 
  2. Query from ES through Hive using JDBC/ODBC and write into ES
    What would be the suggested approach? How much data can be ingested into ES from Hadoop (wanted to check any limitation)? Say If I wanted to ingest/write 20 GB of data, does ES Server should have that amount of memory?

(Costin Leau) #2
  1. You can move data from Hadoop to ES using any of the libraries supported; it can Hive or Pig or Map/Reduce or Cascading or Spark or Storm.
    While Hive does have an appeal due to its SQL-like capabilities note that all the other libraries support the notion of schema. Further more that is not really needed when only doing ingestion or basic validation.
    Also, if you are looking for performance you might want to look around as there might be better candidates for this. In particular Spark and Spark SQL is getting a lot of traction due to its simple deployment, ease of use and speed.
    I'm not saying Hive is not a good choice but rather you have plenty of options. And that's a good thing.
  1. See 1. You can use any of the libraries above and if SQL is your thing, you can also use Spark SQL. Note that if you want to visualize data, you can do so directly in ES through Kibana or other tools; in other words, you are not forced to do the querying from Hadoop; one the data is in ES, any tool/library/client that works with ES can be used.

There aren't any limitations outside hardware; there are too many posts, blog posts, sessions, presentations on the subject that area easily accessible through a simple search, for me to try to cover it all.
The more memory you allocate to ES, the better performance it will have in particular at querying time. 20GB is not a lot of data so you should be able to work fine with the defaults.

P.S. ES doesn't have to load all the data in memory - otherwise it would not scale. So no, it does not require 20GB of memory however if you do have that RAM available for it, it will happily accept it.

(system) #3