Bulk indexing tips

Been experimenting with various settings to speed up bulk loading of
30 million medium sized documents to a 2 (for now) node cluster. The
eventual goal is to periodically recreate the entire index to a new
one, while preserving search on the current index via an alias.

First pass was a simple single bulk indexer called via multiple worker
threads, which was adequate but far from ideal. I then switched to
each worker thread having its own bulk indexer. Eventually each worker
swamped ES with too many indexing requests.

If I understand correctly, calling execute() with an ActionListener is
executed asynchronously, while execute().actionGet() is synchronous
(blocking)? It seems that I was starting too many bulk indexing
threads with the asynchronous call that took too long to execute.

Next was TranspontClient versus Node, but there seems was little
difference. However, it seems that all index requests were made to a
single server and not round-robin. Are searches only round-robined or
should index requests be as well? Would it make sense to direct all
index requests to a single server anyways?

The next step is to experiment with indexes settings such as refresh
and the translog. For the various translog settings, which setting has
the highest priority, the one that occurs first or last? For example
if the threshold size hits the limit before the number of operations.
Is a bulk index of 10 documents 1 operation or 10? Is the
index.gateway.local.sync setting still used? I do not see any
references to it in the code.

Cheers,

Ivan

What is your setup (no. of shards/indices)? How many items do you
include in one bulk index request?

As one tuning option you could increase lucenes merge-segment number
to 20 or 30 (and switch back to 10 afterwards).
Also the refresh interval for real time latency can be increased to 10
seconds or more or even disable it (-1) to get faster indexing time:

If I understand correctly, calling execute() with an ActionListener is
executed asynchronously, while execute().actionGet() is synchronous
(blocking)?

Yes

Is a bulk index of 10 documents 1 operation or 10?

What do you mean here? They will send as a chunk to the server but
every document will be added via luceneWriter.updateDocument()

Peter.

On Tue, Dec 13, 2011 at 2:39 AM, Karussell tableyourtime@googlemail.com wrote:

What is your setup (no. of shards/indices)? How many items do you
include in one bulk index request?

Currently using the default of 5 shards and 1 replica for only 1
index. The idea is to first get all data loaded in order to benchmark
various queries. After which we will decide the number of nodes and
shards. The 30 million document index is our main concern, but there
will be other indices in the system as well. The bulk index request
contains 50 documents, which was set somewhat arbitrarily.

As one tuning option you could increase lucenes merge-segment number
to 20 or 30 (and switch back to 10 afterwards).
Also the refresh interval for real time latency can be increased to 10
seconds or more or even disable it (-1) to get faster indexing time:

Elasticsearch Platform — Find real-time answers at scale | Elastic

I knew about the refresh interval (although I did not use it), but did
not think about the number of merge segments. Is there a direct method
to setting the refresh interval in the Java API, or is it simply a
setting?

client.admin().indices().updateSettings(...)

What do you mean here? They will send as a chunk to the server but
every document will be added via luceneWriter.updateDocument()

The translog setting is in terms of "operations" and I was no sure at
what level the definition of operation is. Sounds like the single bulk
index request of N documents will be N operations.

Ivan

On Tue, Dec 13, 2011 at 2:39 AM, Karussell tableyourtime@googlemail.com wrote:

Also the refresh interval for real time latency can be increased to 10
seconds or more or even disable it (-1) to get faster indexing time:

Forgot to add: what would the difference be between setting the
refresh interval to -1 per index versus setting the refresh flag per
request?

indexRequest.refresh(false)

--
Ivan

Heya,

First, here are some pointers to improve indexing performance:

  1. One of the simplest and effective ones is to simply start with a index
    with no replicas. And once indexing is done, increase the number of
    replicas to the number you want to have. This will reduce the load when
    indexing.

  2. The refresh interval setting can help. Setting it to -1 (using the
    update setting API) means that the async interval of refreshing the index
    will not happen (you can still explicitly call refresh though). The refresh
    flag on the index request defaults to false, and its there to force refresh
    after an index request has executed. It does not relate to what you do.

  3. The translog gets flushed once one of the different thresholds is met. I
    don't think you need to play with it too much (cause large Lucene commits
    can actually stall indexing since they are blocking).

  4. The default merge policy (tiered -
    Elasticsearch Platform — Find real-time answers at scale | Elastic) is
    pretty good for bulk indexing as well. You can play with a higher value
    for segments_per_tier (defaults to 10). These settings can dynamically be
    set using the update settings API as well.

On Tue, Dec 13, 2011 at 7:54 PM, Ivan Brusic ivan@brusic.com wrote:

On Tue, Dec 13, 2011 at 2:39 AM, Karussell tableyourtime@googlemail.com
wrote:

Also the refresh interval for real time latency can be increased to 10
seconds or more or even disable it (-1) to get faster indexing time:

Forgot to add: what would the difference be between setting the
refresh interval to -1 per index versus setting the refresh flag per
request?

indexRequest.refresh(false)

--
Ivan

On Fri, Dec 16, 2011 at 7:20 AM, Shay Banon kimchy@gmail.com wrote:

First, here are some pointers to improve indexing performance:

Thanks for the response. I've been gone from the list from a couple of
months, but I have slowly scanned most of what I have missed.
Unbelievable amount of great new features in that short amount of
time.

  1. One of the simplest and effective ones is to simply start with a index
    with no replicas. And once indexing is done, increase the number of replicas
    to the number you want to have. This will reduce the load when indexing.

I just played around with setting the number of replicas. The workflow
should be:

  1. create a new index with number_of_replicas=0
  2. bulk index
  3. set number_of_replicas = n (TBD)
  4. move search alias to new index

On my current 20GB test index (2 nodes, 5 shards), setting
number_of_replicas to 0 will split the shards between the two nodes.
Then setting number_of_replicas to 1 will cause the shards to be
replicated, pushing out 20GBs of data between the two machines. My
questions are:

For step 2, with no replicas, should bulk indexing occur only on 1
node or should it round-robin between all nodes? Seems like the former
is the obvious choice, but checking if I am missing out on some
benefit by choosing the latter.

Is the new index searchable, at an acceptable performance level, just
after setting number_of_replicas to something other than 0? For
example, if I set number_of_replicas to 2 on a 4 node cluster, would
it be safe to use the index, or should I wait for the cluster state to
return to green? Is there a listener for cluster state notifications?
Would prefer not to have to busy wait on the cluster health.

--
Ivan

Answers below:

On Fri, Dec 16, 2011 at 10:12 PM, Ivan Brusic ivan@brusic.com wrote:

On Fri, Dec 16, 2011 at 7:20 AM, Shay Banon kimchy@gmail.com wrote:

First, here are some pointers to improve indexing performance:

Thanks for the response. I've been gone from the list from a couple of
months, but I have slowly scanned most of what I have missed.
Unbelievable amount of great new features in that short amount of
time.

cheers!

  1. One of the simplest and effective ones is to simply start with a index
    with no replicas. And once indexing is done, increase the number of
    replicas
    to the number you want to have. This will reduce the load when indexing.

I just played around with setting the number of replicas. The workflow
should be:

  1. create a new index with number_of_replicas=0
  2. bulk index
  3. set number_of_replicas = n (TBD)
  4. move search alias to new index

Sounds good.

On my current 20GB test index (2 nodes, 5 shards), setting
number_of_replicas to 0 will split the shards between the two nodes.
Then setting number_of_replicas to 1 will cause the shards to be
replicated, pushing out 20GBs of data between the two machines.

If you know you are going to index to 2 nodes, then I would use 4 or 6
shards so they will be even between the nodes and each node will do
balanced share of indexing.

My
questions are:

For step 2, with no replicas, should bulk indexing occur only on 1
node or should it round-robin between all nodes? Seems like the former
is the obvious choice, but checking if I am missing out on some
benefit by choosing the latter.

You are using the Java client right? If you use the NodeClient, it will
send the respective "shard bulks" directly to the shards, if you use the
transport client, i will round robin to a node, and then that node will
break the bulk to respective "shard bulks" and send them.

Is the new index searchable, at an acceptable performance level, just
after setting number_of_replicas to something other than 0? For
example, if I set number_of_replicas to 2 on a 4 node cluster, would
it be safe to use the index, or should I wait for the cluster state to
return to green? Is there a listener for cluster state notifications?
Would prefer not to have to busy wait on the cluster health.

Yes, the index is searchable. There will be a cost of allocation those
replicas and moving data to them. You can reduce that using some settings
described here:
Elasticsearch Platform — Find real-time answers at scale | Elastic. For
example, setting: node_concurrent_recoveries to 1, and
indices.recovery.concurrent_streams to 1.

--
Ivan