Bulk indexing 1.6 billion rows with ~400 columns using es-hadoop hive

Hi All,

I am trying to load a table in Hive using ES-Hadoop Hive (1.6 billion rows with 380 columns) into two indices with 16 shards.

  • Refresh interval is set to 60s
  • Replication is set to 0
  • Indices buffer size 30%
  • Batch size bytes 10mb
  • Batch size count 200
  • Compression - best compression

Cluster: 3 master, 4 data/ingest nodes(64 GB RAM total - 32GB heap)

Erlier I had tried with 8 shards it was giving avg 5k rows/sec. Now after increasing shards it is fluctuating between 1k to 3k.

I tried using default compression, it gave minimal improvement.

Any suggestions how to debug this? Any design considerations to make it better for search and improve bulk indexing?


It's possible that the number of columns here is causing your write throughput to skew slower. Each record takes up more space in the request than a smaller one. When introducing more shards, bulk requests need to be split into more shard level bulk requests, each containing fewer documents than before. While the number of documents is still low and should generally be processed quickly, the ultimate amount of time to write those documents may be higher because of their size. Ultimately, you might just be running up on the overhead of transmitting and writing the larger documents. Do you know which of the batch size limits that the connector is triggering flushing on more often? If it is the counts value, perhaps increase that number a bit?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.