Speeding up indexing in ES 2.2.0

Hi All,
I am trying to test performance of service with ES 2.2.0 and I am seeing slowness in rate of indexing. We use SSDs for all of our data nodes. What I am seeing is that if I test with the same setup with ES 1.7.3, we can index documents on avg 70K/sec, but with ES 2.2.0, the rate drops to 40K/sec. What I am looking for is advice on what things I can start to look at to get better indexing performance. Here is a description of our test setup, which is identical for both 1.7.3 and 2.2.0

service nodes 3 m2.4xls
ES master 3 nodes m3.2xlarge
ES Data 9 c3.8xls

We are using the default merge policy which comes with ES 2.2.0. For 1.7.3, we have set indices.store.throttle.type = none to speed up indexing. This setting (and various others) have been removed by ES 2.X.

Any clues on what I can start looking into?

Thanks,
Madhav.

1 Like

The default translog flush policy in ES 2.x changed to synchronous. So there is a performance hit.

If you want to use same policy as ES 1.X, set this in your ES config

# riskier, but faster (same as ES pre-2.x; default in 2.x is 'request')
index.translog.durability: async

Thanks Tinle for the pointer!

I just tried it, and it has increased indexing speed by a little, but not by much...its still around 42K/sec mark..is there something else I can check?

Thanks,
Madhav.

You should be able to use a larger bulk size to get performance to the
point where the default is only a little slower than async. So, yeah, my
suggestion is to try larger bulk sizes in 2.x and to make sure your refresh
interval is high-ish, like 30 seconds.

This setting was moved to index template, so you can use it there.

"settings": {
  "index": {
    "store": {
      "throttle": {
          "type": "none"
        }
      }
   }
}

You can also try to disable doc_values for non analyzed fields (if you don't really need this). This can also improve indexing rate and minimize disk usage.

Thanks Nik & Rusty!

I will try these out and see what happens.

@Rusty - is this new setting dynamic? I do not see it documented anywhere...

Thanks,
Madhav.

My bad, this setting removed from ES 2.2

70k/sec - do you mean 70k docs per second, or 70 kilobytes per second?

Like @rusty I would recommend trying to disable doc_values for fields where applicable, to see if you experience a performance increase. More on doc_values: http://stackoverflow.com/questions/32332487/what-are-the-disadvantages-of-elasticsearch-doc-values

@jprante its 70k docs per second.

@JoarSvensson we have already disabled doc_values where ever we could.

I am testing out increased batch sizes as we speak, I will post the results soon enough.

Thanks,
Madhav.

@mkelkar Any luck with increased batch sizes?

@JoarSvensson - I just finished doing a couple of large scale tests minutes ago...

  1. Increased batch size for bulk-index-request to 5k instead of earlier 1k - this was total failure. ES started throwing out a bunch of exceptions, all of which had this - 'NotSerializableExceptionWrapper[Failed to acknowledge mapping update within [30s]'
  2. I decreased batch size to 4k - similar story..ES did not throw out exceptions, but the bulk write rate was about 20k which was way lower as compared 40K to using batch size 1k ...

Also, I tried to disable auto throttling on merges(index. and setting max_thread_count to 1 which did not help either...

Its worth mentioning that our process runs in two phases , first phase is indexing heavy, and second phase is query heavy. The way it works is it

  1. first indexes all documents in ES for a client ( phase1) .
  2. as soon as all docs are written for a client, then it moves to Phase2 where the docs are updated according to our business rules and are indexed again..

What I have seen with 2.2.0 is that while phase1 is running for some clients, phase2 begins processing way faster than 1.7.3. I hunch is that in 1.7.3 because we set indices.throttle.type = none, indexing is topmost priority so we finish phase1 lot quicker, and Phase2 processing is automatically slowed down while phase1 is running...But for 2.2.0, Phase2 is way faster for some reason ( most probably because indexing is not the topmost priority anymore, and our queries are faster because of ES 2.2.0 optimizations ) .....

I would like to do something similar to 1.7.3 where we used to assign top most priority to indexing, but I am not aware of any settings which would do that in ES 2.2.0...I have already tried (index.merge.scheduler.auto_throttle = false) and (index.merge.scheduler.max_thread_count=1) with no luck....any clues on how to proceed further?

Thanks,
Madhav.

Interesting case indeed. I haven't come across settings to alter priority as of yet anyway. It feels like logic part of the internals of ES.

The two thing on top of my mind is if you are either able to perform just one indexing and do the phase2 logic in one combined step prior to indexing. Or if you could add more nodes to help offload the cluster, or even have to separate clusters if possible to be sure you're not doing heavy indexing and querying at the same time.

Hopefully someone else has better ideas.

We cannot perform phase2 logic within phase1 because it depends on all documents being written in ES first..we denormalize documents in Phase2, so if not all documents are available then our results would be incorrect.

We already have a pair of clusters because we did see this conflict between indexing vs querying at the same time. We use one cluster just to transform our docs, and the second one just serves query traffic. However, because we have to do our doc processing in Phases, we have the same conflict on our doc processing cluster with 2.2.0...with 1.X we do not have any problems whatsoever....

I was planning to throttle our phase2 processing until Phase1 completes, but I wanted to see if there are better ideas for increasing priority for indexing....

Thanks,
Madhav.

ES2 limits resources for a) bulk and b) segment merging (threads). There are good reasons for that, and there are no knobs to turn to enable something like "priority to indexing", because in ES1, you could only allocate more threads for bulk or segment merge than JVM can handle, which had serious side effects.

I have a similar workload case like yours. You can avoid extra load when you create a new index in phase 2. Indexing new documents into an existing index (replacing existing docs) is more expensive than writing into an empty index. After that, switch an index alias form old to new index.

@jprante During phase2 I have no problems, its Phase1 that cause problems...

Also, If I switch to bulk size of 500 docs / sec, I get a flood of these exceptions in the logs -

java.util.concurrent.TimeoutException: Failed to acknowledge mapping update within [30s]
at org.elasticsearch.cluster.action.index.MappingUpdatedAction.updateMappingOnMasterSynchronously(MappingUpdatedAction.java:122)
at org.elasticsearch.cluster.action.index.MappingUpdatedAction.updateMappingOnMasterSynchronously(MappingUpdatedAction.java:112)
at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:228)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:326)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:119)
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:68)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:595)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:263)
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:260)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

looks like it takes more than 30 seconds to ack the mapping update request...what should I start looking at changing?

Thanks,
Madhav.

FWIW, here is what I found -

We used to have an ES node client embedded in our service JVM. During bulk indexing, our service was doing lot of work, which caused the embedded node client not to respond to cluster state updates...switching to transport client fixed the problem...

The docs say that index.store.throttle.type setting was removed, but I still see it in IndexStore class and in https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html documentation.