Elasticsearch loses performance changing path.data running on SparkSQL

Hi everyone,
I'm using Elastisearch on a cluster with 4 nodes using SparkSQL. I'm trying to copy tables from the Hive-metastore to Elasticseach using saveToEs. Everything works good if I work with tables with size around 13Gb, but if I want to index a huge table with 60Gb or more something goes wrong.

The problem is that many executors should write on the only folder /var/lib/elastisearch and something goes wrong. Each node of cluster has 10 hard disk, so to improve parallelism I added to path.data more folders, each per disk, so at the end I had

`path.data=/data1/elasticsearch,.......,/data10/elasticsearch`  

I restarted Elastic and everything was good. I checked the new configuration with

`curl  http://namenode:9200/_nodes/settings?pretty` 

The problem is that with:

  • the default path.data the index-speed is 1Gb per minute
  • the new configuration speed is 1Gb per 2/2,5 minutes.

So with the second configuration I can index huge table (because the process doesn't crash), but Elastic is very slow, with the default configuration Elastic is very fast, but I cannot index table bigger than 15Gb. Is there something to configure?

I installed Elasticsearch 2.3.2 and I'm using the same version for maven dependecy for scala driver to use to launch Spark.

Thanks

Hello!

Since path.data is a property of Elasticsearch itself, this topic is probably better suited for the regular Elasticsearch forum. Unless you are saying that it's only impacted when indexing data through SparkSQL?

Actually I tried to work with Elastic only using Spark or Hive. Do you think it's better to change my post?

Considering the only thing changed between your runs is the Elasticsearch path.data property (which is not a Spark or Hive property), I think you'll get faster and better feedback from the regular Elasticsearch forum. I can see if I can move the topic for you.