Been experimenting with various settings to speed up bulk loading of
30 million medium sized documents to a 2 (for now) node cluster. The
eventual goal is to periodically recreate the entire index to a new
one, while preserving search on the current index via an alias.
First pass was a simple single bulk indexer called via multiple worker
threads, which was adequate but far from ideal. I then switched to
each worker thread having its own bulk indexer. Eventually each worker
swamped ES with too many indexing requests.
If I understand correctly, calling execute() with an ActionListener is
executed asynchronously, while execute().actionGet() is synchronous
(blocking)? It seems that I was starting too many bulk indexing
threads with the asynchronous call that took too long to execute.
Next was TranspontClient versus Node, but there seems was little
difference. However, it seems that all index requests were made to a
single server and not round-robin. Are searches only round-robined or
should index requests be as well? Would it make sense to direct all
index requests to a single server anyways?
The next step is to experiment with indexes settings such as refresh
and the translog. For the various translog settings, which setting has
the highest priority, the one that occurs first or last? For example
if the threshold size hits the limit before the number of operations.
Is a bulk index of 10 documents 1 operation or 10? Is the
index.gateway.local.sync setting still used? I do not see any
references to it in the code.