Is Hadoop the right solution for distributed computation with ES?

nicom · April 2, 2019, 11:22am

I have a elasticsearch cluster (300Go) with 2 nodes (I know.. 3 is min).

We have a java program that use the java Rest API to fetch all the data from ES cluster and execute custom clustering algorithms.

To avoid network bottleneck between ES cluster and the program we currently execute the program locally into one of the Elasticsearch node, which is super fast.
furthermore we do it in parallel: by cutting the data in N distinct partitions, and fetching + clustering the data in parallel.
(NOTE: clustering is not trivial as it seems. a priori the goal is not to change our custom clustering method here)

This solution is simple and works well for real time processing. However when we need to reprocess all the data (when we improved the clustering algorithm for instance) we have the following issue:

this ES node is suffering (90% CPU)
we have some rare case of Out of Memory, but still we need to fix that.
this approach is not scalable/distributed to fully use the distribution of computation though the partitions.

I am a newbie in Hadoop and wonder whether using Hadoop would solve such pb. my main concern is

how to sync data between ES cluster and the distributed processing platform /how to avoid 400 Go download.
how to limit/avoid big change to our clustering java program.

Thanks !

system · April 30, 2019, 11:22am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Hadoop / Elasticsearch functionality Elasticsearch es-hadoop	20	3226	July 6, 2017
How to index data when Hadoop and ES are in different clusters Elasticsearch es-hadoop	3	1404	July 6, 2017
ES with Hadoop Elasticsearch	1	321	July 6, 2017
ES-Hadoop peer-to-peer architecture Elasticsearch es-hadoop	6	1465	July 6, 2017
How is Hadoop and ES typically used? Elasticsearch es-hadoop	8	1713	July 6, 2017

Is Hadoop the right solution for distributed computation with ES?

Related Topics