Hi everyone,
I'm using Elastisearch on a cluster with 4 nodes using SparkSQL. I'm trying to copy tables from the Hive-metastore to Elasticseach using saveToEs. Everything works good if I work with tables with size around 13Gb, but if I want to index a huge table with 60Gb or more something goes wrong.
The problem is that many executors should write on the only folder /var/lib/elastisearch and something goes wrong. Each node of cluster has 10 hard disk, so to improve parallelism I added to path.data more folders, each per disk, so at the end I had
`path.data=/data1/elasticsearch,.......,/data10/elasticsearch`
I restarted Elastic and everything was good. I checked the new configuration with
`curl http://namenode:9200/_nodes/settings?pretty`
The problem is that with:
- the default path.data the index-speed is 1Gb per minute
- the new configuration speed is 1Gb per 2/2,5 minutes.
So with the second configuration I can index huge table (because the process doesn't crash), but Elastic is very slow, with the default configuration Elastic is very fast, but I cannot index table bigger than 15Gb. Is there something to configure?
I installed Elasticsearch 2.3.2 and I'm using the same version for maven dependecy for scala driver to use to launch Spark.
Thanks