Bulk indexing 1.6 billion rows with ~400 columns using es-hadoop hive

gsahu25 · April 30, 2020, 6:54am

Hi All,

I am trying to load a table in Hive using ES-Hadoop Hive (1.6 billion rows with 380 columns) into two indices with 16 shards.

Refresh interval is set to 60s
Replication is set to 0
Indices buffer size 30%
Batch size bytes 10mb
Batch size count 200
Compression - best compression

Cluster: 3 master, 4 data/ingest nodes(64 GB RAM total - 32GB heap)

Erlier I had tried with 8 shards it was giving avg 5k rows/sec. Now after increasing shards it is fluctuating between 1k to 3k.

I tried using default compression, it gave minimal improvement.

Any suggestions how to debug this? Any design considerations to make it better for search and improve bulk indexing?

Regards,
Gopal

james.baiera · April 30, 2020, 4:46pm

It's possible that the number of columns here is causing your write throughput to skew slower. Each record takes up more space in the request than a smaller one. When introducing more shards, bulk requests need to be split into more shard level bulk requests, each containing fewer documents than before. While the number of documents is still low and should generally be processed quickly, the ultimate amount of time to write those documents may be higher because of their size. Ultimately, you might just be running up on the overhead of transmitting and writing the larger documents. Do you know which of the batch size limits that the connector is triggering flushing on more often? If it is the counts value, perhaps increase that number a bit?

system · May 28, 2020, 4:46pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bulk write to ES \| best practices Elasticsearch es-hadoop	4	5525	July 6, 2017
Hive overwhelming Elasticsearch Elasticsearch es-hadoop	24	1433	May 18, 2021
Why my master node receives many POST bulk requests when using es-hive-connector to create index? Elasticsearch es-hadoop	4	855	May 3, 2017
Data size on disk increase 15 times when moved from hive to elasticsearch Elasticsearch es-hadoop	27	755	July 7, 2022
Hive -> ES - Too Many Requests(429) Elasticsearch es-hadoop	8	3358	July 6, 2017

Bulk indexing 1.6 billion rows with ~400 columns using es-hadoop hive

Related topics