Hug speed difference between es2.4 and es6.6

Georgi_Ivanov · February 28, 2019, 11:20am

Hello,
I am almost done with the migration of out old ES 2.4 to the new ES6.6
So the time has come to perform some real-world task.
I have a java job which fetches data from ES and stores it in a file.
So the same job is now migrated to use the high level REST client. Other than that, there no difference in the code (the queries are 100% the same).

The job works...
but...
it is more than 10 times slower compared to the same job against the old cluster.

There are some differences between the clusters as follows:
OLD : 34 nodes(4 cores per node), NEW 10 (8 cores per node) nodes
OLD : uses less powerful hardware compared to the new one.
Both clusters use SSD drives
The new cluster runs inside docker containers, one container per node. Nothing else runs on the servers.
The indices in the new cluster have less shards (11 in the old, 6 in the new). So i have larger shards. There is no shard larger then 40GB atm.
The data in both clusters is 100% the same (25 Bio docs)

My test was against a single index with (~800 Mio docs, 224 GB with the replication of 1).
During the test, there is not I/O wait more than 0.3-0.4 so i think the problem is not IO related.
The CPU however is heavily utilized( above 80%)

The elasticsearch.yml is the default.
TL;DR

Am i missing some important configuration that needs to be done in the elasticsearch.yml ?
What could be the reason for such a huge difference in the speed ?
Can the REST client introduce such a huge difference?

Christian_Dahlqvist · February 28, 2019, 12:01pm

Can you show the queries you are running and tell us a bit more about your data?

Georgi_Ivanov · February 28, 2019, 12:16pm

I have to explain more about the job itself.
The job takes as an input arbitrary number of numeric entityIds (for example 10000) and time interval.
Then i partition the list if Ids into chunks of 100.
For each partition i do :

Fetch data from mh index alias. This is one bool query with range for the time interval and terms for the Id's
{
"size": 3000,
"query": {
"bool": {
"must": [
{
"bool": {
"must": [
{
"terms": {
"enity_id": [
1,
....
100
],
"boost": 1
}
},
{
"range": {
"ts": {
"from": 1550534400000,
"to": 1550620740000
}
}
},
{
"range": {
"mmsi": {
"from": 0,
"to": null
}
}
}
],
"must_not": [
{
"term": {
"aclient_id": {
"value": -1
}
}
}
]
}
}
]
}
},
"sort": [
{
"entity_id": {
"order": "asc"
}
},
{
"ts": {
"order": "asc"
}
}
]
}
This uses scroll
pretty much the same query against another index (the main big index)
Again scroll trough all the documents

The job should not be fast in general as it fetched large amount of documents, but i did not expect such a huge difference between the two ES version. What worries me as well, is the huge CPU burn

Georgi_Ivanov · February 28, 2019, 3:22pm

Alright,
migrated the code to the Java client (and not the REST client)

Things are way better. I get similar or better performance now.
It seems that the REST client is the problem.
I guess serializing big number of documents is burning the CPU.

That't too bad, as the Java client will be deprecated in v8. I hope they find a solution, as it seems REST is not an option for fetching a lot of docs from ES

dadoonet · February 28, 2019, 4:29pm

@javanna would you mind taking a look at this? This is not the first time I'm hearing that.

Georgi_Ivanov · February 28, 2019, 4:33pm

@dadoonet, I am also curious how do you squeeze better performance out of the REST client.
May be my use case is not the most common one. But it is also very simple one.

I just need to scroll trough a lot of documents.

dadoonet · February 28, 2019, 5:03pm

I do have better performance but when it comes to bulk API.

I'm not using scroll API so I can't say if it's better or worse but I'm pretty sure I read here someone else saying it's slower for his use case.
May be you should reduce a bit the size?

javanna · March 1, 2019, 10:25am

On one hand the REST client is expected to be a bit slower than transport client, on the other hand not 10 times slower. To do a better comparison, instead of using transport client, it would be interesting to send some REST requests to Elasticsearch, for instance using curl or any other http client, and compare those response times with the ones obtained when using the java REST client. I suspect most of the time will be spent on handling json, but we should prove that with data.

Playing with different size parameters is indeed a good idea.

Also, @baz may want to take a look at this.

Cheers
Luca

Georgi_Ivanov · March 1, 2019, 10:33am

The response times (using Sense) are similar to the old ES. I don't think this is related to the ES version.
I mean... it is a bit hard to test like that as the caches are cold in the new cluster, so my response times vary. The first 10 request are slow (like 2000ms which is a bit too slow IMO), but after that one request takes < 100ms which is the case in the old cluster.

Reducing the size will only make the job run longer but it will not stop the heavy CPU burn.
I mean, the document that i need to fetch are constant. Playing with the size will just change how many request i make to ES

On the old cluster i use size of 2000-3000 (which if i recall is per shard !).
So i think size around 5000 (not per shard) would be reasonable.

Anyway i will try reducing the size to ... 1000 ? 2000?
I don't wanna go too low as this will make the job very slow.

Edit: could this be related to the serialization of geo_shape type ? This is the only filed that is not a keyword or numric

baz · March 4, 2019, 8:36pm

Hi,

Just to verify i cobbled together some high level rest client code vs some apache client HttpPost code using the shakespeare index data set, which is trivial in complexity but has a decent ammt of size to it (over 100k documents). Doing this both ways yielded almost the same time in millis, of course the apache client being faster by approx 100ms (over 2800ms to traverse the whole thing with a 3000 size).

Given that you say the Sense queries are sane, we should be able to rule out the query itself, which means its likely deserialization of your documents from the wire. Im definitely still investigating, as you should definitely not have a 10x slowdown via the rest client. I will be creating a new test with geo shapes next. If you can provide any example about how large the geoshapes are in terms of points / type, that would be great. I plan on starting with some large ones anyway just to see if it causes a big slowdown.

Also, if you suspect that the geo stuff is causing the slowdown then remove it from the output to see if its causing the slowdown at deserialization. This could help point out whether the issue is the geo shapes or not since they wont be returned in the query response.

Thank you for pointing this out!

baz · March 5, 2019, 7:57pm

I have now indexed 100k entries that contain geo_shapes and ran the same benchmark, and I am not seeing much of a difference. Id love to see if removing the geo shapes from the output (as per my message above) makes things any faster. Im sure you have found a bug, its just hard to replicate w/o a dataset and query.

Georgi_Ivanov · March 6, 2019, 8:48am

My geo_shapes are actually just points.
And my indices are much larger then 100k. I have close to 800 million in one index.

baz · March 6, 2019, 4:20pm

Yes but im trying to only exercise the rest client. So my assumption is that you are pulling them back in 3k batches, right? Ive done a scroll thru 100k entries at 3k size and tried to see if there was a difference in the parsing on the rest client, with the geo_shapes, and I did not see any slowdown in the rest client.

What are you doing with the responses? Are you just leaving them as a SearchResponse? If you remove all processing and only iterate over the search response results for each batch, is the time still significantly slower via the rest client? Id like to try to isolate the slowdown as much as possible, so if we can remove anything besides the actual rest client calls (the initial search request and all subsequent scroll requests), then maybe it can start to show us where the slowdown is.

Georgi_Ivanov · March 6, 2019, 4:34pm

Hi there,
I have to admit that there was a bug in my code overloading elasticsearch with a lot of queries which caused the high CPU burn on the cluster(plus some concurrency issues).

Today i have fixed that and i am now in the process of testing rest vs transport clients.

The first results look much better.
So far i see the REST client a bit slower, but my tests are not over.

At least i dont see high CPU burn on the cluster as before.

I also made a very simple test with just scrolling trough matchAll() query and i did not notice any problems with the REST or the transport client.

baz · March 6, 2019, 4:48pm

ok that is great news to hear. You fixed a bug and (hopefully) the rest client will suit your needs! Two wins in one go!!

Thanks also for reporting this in the event it was a real bug and your help in trying to solve it. Keep us updated on how the tests finish up.

system · April 3, 2019, 4:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch indexing (DB->ES) decreases dramatically when source is not co-located on same host Elasticsearch	5	470	November 20, 2018
Upgrading to ES5 and Java TransporClient Elasticsearch	8	785	November 1, 2017
ES 6.2.3 performance issues vs ES 2.3.4 Elasticsearch	13	1192	November 28, 2018
Elasticsearch not scaling beyond ~400 requests per second Elasticsearch	9	6748	July 5, 2017
Migration from ES1.5.4 TO ES 6.3 HDD performance issue Elasticsearch	5	426	August 17, 2018

Hug speed difference between es2.4 and es6.6

Related topics