Elasticsearch loses performance changing path.data running on SparkSQL

jackProm · June 29, 2016, 4:46pm

Hi everyone,
I'm using Elastisearch on a cluster with 4 nodes using SparkSQL. I'm trying to copy tables from the Hive-metastore to Elasticseach using saveToEs. Everything works good if I work with tables with size around 13Gb, but if I want to index a huge table with 60Gb or more something goes wrong.

The problem is that many executors should write on the only folder /var/lib/elastisearch and something goes wrong. Each node of cluster has 10 hard disk, so to improve parallelism I added to path.data more folders, each per disk, so at the end I had

`path.data=/data1/elasticsearch,.......,/data10/elasticsearch`

I restarted Elastic and everything was good. I checked the new configuration with

`curl  http://namenode:9200/_nodes/settings?pretty`

The problem is that with:

the default path.data the index-speed is 1Gb per minute
the new configuration speed is 1Gb per 2/2,5 minutes.

So with the second configuration I can index huge table (because the process doesn't crash), but Elastic is very slow, with the default configuration Elastic is very fast, but I cannot index table bigger than 15Gb. Is there something to configure?

I installed Elasticsearch 2.3.2 and I'm using the same version for maven dependecy for scala driver to use to launch Spark.

Thanks

james.baiera · June 30, 2016, 6:26pm

Hello!

Since path.data is a property of Elasticsearch itself, this topic is probably better suited for the regular Elasticsearch forum. Unless you are saying that it's only impacted when indexing data through SparkSQL?

jackProm · July 1, 2016, 7:39am

Actually I tried to work with Elastic only using Spark or Hive. Do you think it's better to change my post?

james.baiera · July 1, 2016, 2:18pm

Considering the only thing changed between your runs is the Elasticsearch path.data property (which is not a Spark or Hive property), I think you'll get faster and better feedback from the regular Elasticsearch forum. I can see if I can move the topic for you.

Topic		Replies	Views
Using Pig/Spark on ElasticSearch (as External Storage) Elasticsearch	3	443	July 6, 2017
Spark write parquet record to elasticsearch too slowly Elasticsearch es-hadoop	4	1928	July 6, 2017
Tunning ElasticSearch with Spark Elasticsearch	1	405	July 5, 2017
[Hadoop] Slow performance of Elasticsearch-Hadoop + Spark SQL Elasticsearch	2	1013	July 6, 2017
Elasticsearch + Spark read performance issues Elasticsearch es-hadoop	3	2311	May 24, 2016

Elasticsearch loses performance changing path.data running on SparkSQL

Related topics