I have a elasticsearch cluster (300Go) with 2 nodes (I know.. 3 is min).
We have a java program that use the java Rest API to fetch all the data from ES cluster and execute custom clustering algorithms.
- To avoid network bottleneck between ES cluster and the program we currently execute the program locally into one of the Elasticsearch node, which is super fast.
- furthermore we do it in parallel: by cutting the data in N distinct partitions, and fetching + clustering the data in parallel.
- (NOTE: clustering is not trivial as it seems. a priori the goal is not to change our custom clustering method here)
This solution is simple and works well for real time processing. However when we need to reprocess all the data (when we improved the clustering algorithm for instance) we have the following issue:
- this ES node is suffering (90% CPU)
- we have some rare case of Out of Memory, but still we need to fix that.
- this approach is not scalable/distributed to fully use the distribution of computation though the partitions.
I am a newbie in Hadoop and wonder whether using Hadoop would solve such pb. my main concern is
- how to sync data between ES cluster and the distributed processing platform /how to avoid 400 Go download.
- how to limit/avoid big change to our clustering java program.