[Resolved] Elasticsearch 2.1.0 indexing 40% slower than 1.7.0

We have upgraded from version 1.7.0 to 2.1.0. Running our regression test suite and maintaining the same settings as before (e.g. 1 shard/index, 0 replicas) we observe a 40% slowdown. I even tried simply sending some documents via the bulk REST endpoint, and I observe a similar reduction to speed. Note that most of the tests simply index documents, there is little searching involved.

Is that expected?

I looked into new settings in the documentation and I noticed two things that could be causing the slowdown:

  • the DEFLATED compression algorithm: but it's turned off by default so it shouldn't be the reason
  • the doc_values: I tried to do some runs with a build where I disabled doc_values for all fields in the index mapping with no observed change in performance

Any ideas would be appreciated.

What are the exact steps that your test suite executes? Can you share a small reproduction?

Can you share any of the custom settings that might be relevant to performance?

Is that all it does, or does it by chance go through a cycle of creating an index, indexing some documents, deleting that index and repeating again with the same index name?

What are the exact steps that your test suite executes? Can you share a small reproduction?

I am afraid I cannot share. However, the system mainly indexes a lot of documents in elasticsearch. The tests exercise the system which generates documents that end up getting indexed. No deletions involved.

Can you share any of the custom settings that might be relevant to performance?

Sure:
index.number_of_shards: 1
index.number_of_replicas: 0

Only other settings are the cluster.name and enabling of http.cors. A node client is used both before and after the upgrade to 2.1.0.

Is that all it does, or does it by chance go through a cycle of creating an index, indexing some documents, deleting that index and repeating again with the same index name?

I am aware of the problem with dropping and recreating an index. I am looking forward to 2.1.1 to have that fixed. However, this is not what is going on.

As I mentioned, even performing a simple bulk upload of docs straight using the REST API (not going through our system at all) with the above configuration is slower than 1.7.0. I have seen the performance charts that are available for ES and I can see that everything appears to be going the right direction. This is why I am puzzled by what I observe.

You don't have to share your proprietary code, just a small reproduction that doesn't contain any sensitive code would be most helpful.

Are you using your own IDs or are you allowing Elasticsearch to create auto-IDs? Are you using a fixed set of fields, or are you creating random field names during the test run?

This is the first report I've seen of a performance degradation of this nature, and there isn't a lot information to go off of so far.

OK, I will work on creating something reproducible that I can share and come back to you.

Do you use dynamic mapping extensively?

index.translog.durability's default switched from async to request. Try flipping it back to request and see if that accounts for some of your performance change. The default changed from "if elasticsearch crashes I will replay the past x seconds of changes" to "if elasticsearch crashes I want my data there when it comes back up".

I think its a better default but it costs. I believe when we ran the numbers it was a 7% hit. But that assumes that you are using somewhat optimized bulk requests. If you update document by document I can see 40% hit. I can really really see it if you have spinning disks.

Also make sure you've refresh_interval to 30s or something similarly large. Or -1. I don't imagine the default refresh interval has gotten less efficient, but I believe the load tests we did for request durability were done with refresh_interval set to something longer than the default.

But I'm with @jasontedor, without a curl recreation that we can replay against both versions we're just shooting in the dark. If you can't boil the proprietary stuff out of your curl recreation then try and use the hot_threads api or jstack to figure out what is going on. Its what I'd do once I got your curl recreation if I couldn't figure it out just by looking at it.

Thank you for explaining the index.translog.durability setting, it's good to know. However, it made no difference. While I am thinking of a way to give you something reproducible, I thought of checking and comparing the NodeStats between the two versions. On clean installs, I have run the same data through version 1.6.0 and 2.1.0. I am attaching the NodeStats for both versions.

There are some interesting differences.

  1. Indexing - 1.6.0

    "indexing" : {
    "index_total" : 86797,
    "index_time_in_millis" : 7894,
    "index_current" : 0,
    "delete_total" : 0,
    "delete_time_in_millis" : 0,
    "delete_current" : 0,
    "noop_update_total" : 0,
    "is_throttled" : false,
    "throttle_time_in_millis" : 0
    },

Indexing - 2.1.0

"indexing" : {
  "index_total" : 86797,
  "index_time_in_millis" : 10204,
  "index_current" : 0,
  "index_failed" : 13,
  "delete_total" : 0,
  "delete_time_in_millis" : 0,
  "delete_current" : 0,
  "noop_update_total" : 0,
  "is_throttled" : false,
  "throttle_time_in_millis" : 0
},

I notice the new field "index_failed" which has a non-zero value. How should I interpret that?

  1. Merges - 1.6.0

    "merges" : {
    "current" : 0,
    "current_docs" : 0,
    "current_size_in_bytes" : 0,
    "total" : 12,
    "total_time_in_millis" : 1740,
    "total_docs" : 216424,
    "total_size_in_bytes" : 15248619
    },

Merges - 2.1.0

"merges" : {
  "current" : 0,
  "current_docs" : 0,
  "current_size_in_bytes" : 0,
  "total" : 13,
  "total_time_in_millis" : 4878,
  "total_docs" : 278771,
  "total_size_in_bytes" : 29956261,
  "total_stopped_time_in_millis" : 0,
  "total_throttled_time_in_millis" : 0,
  "total_auto_throttle_in_bytes" : 62914560
},

This is the most surprising. Although it's the same data, the number of documents has gone up and the size is roughly double. (I repeat that in both cases there config is 1 shard/index, 0 replicas).

Are these giving a hint about what could be happening differently and explaining the slowdown?

The full files can be downloaded at:

https://s3.amazonaws.com/prelert-dimitris/node_stats/node_stats_210.json
https://s3.amazonaws.com/prelert-dimitris/node_stats/node_stats_160.json

Would it perhaps be this bug?

Ivan

On the same system?

Not exactly. It does indicate that there is an increased amount of time spent merging in 2.1.0 relative to 1.6.0 that more than compensates for the change in indexing performance, but it doesn't explain why. What are the specs of your disk and CPU?

That said, I notice that these tests only run for 7.894 and 10.204 seconds. If we were to get a reproduction, we would want one that runs for a lot longer than that. Can you reproduce this with the indexing running for a substantially longer period of time?

On the same system?

Yes.

What are the specs of your disk and CPU?

On the machine I run that test I have a 4-core i7 @ 2.2GHz and SSD drive.

Update

I was impatient and wrong about the impact of setting index.translog.durability to async. I only tested it against the same test I gave the node stats, which indeed, has too short of a duration. The regression tests run against an install that had the setting changed to async and the times went down to a similar level as with version 1.7.0. The machine were the regression tests are running have 15K rpm spinning disks and the duration of the tests is ~20 min (went up to ~28 min before changing the setting back to async).

I will set-up another test were I run a single job that is long in duration. I speculate that new features like auto-throttling might be impacting performance negatively in the beginning of the life of an index which is why I see this behaviour on short (and unrealistic) running jobs. Would that be a valid theory?

Not to forget: Thank you @jasontedor & @nik9000 for the great help you've been!

Yes, that's exactly what I had in mind when I asked the question about whether or not this reproduced on a longer run.

I'm glad that we were able to get to the bottom of this issue.

Sincerely, thank you for your question and your interest in Elasticsearch. :smile:

@jprante No, mapping is fixed upon index creation.