Hello,
I am almost done with the migration of out old ES 2.4 to the new ES6.6
So the time has come to perform some real-world task.
I have a java job which fetches data from ES and stores it in a file.
So the same job is now migrated to use the high level REST client. Other than that, there no difference in the code (the queries are 100% the same).
The job works...
but...
it is more than 10 times slower compared to the same job against the old cluster.
There are some differences between the clusters as follows:
OLD : 34 nodes(4 cores per node), NEW 10 (8 cores per node) nodes
OLD : uses less powerful hardware compared to the new one.
Both clusters use SSD drives
The new cluster runs inside docker containers, one container per node. Nothing else runs on the servers.
The indices in the new cluster have less shards (11 in the old, 6 in the new). So i have larger shards. There is no shard larger then 40GB atm.
The data in both clusters is 100% the same (25 Bio docs)
My test was against a single index with (~800 Mio docs, 224 GB with the replication of 1).
During the test, there is not I/O wait more than 0.3-0.4 so i think the problem is not IO related.
The CPU however is heavily utilized( above 80%)
The elasticsearch.yml is the default.
TL;DR
Am i missing some important configuration that needs to be done in the elasticsearch.yml ?
What could be the reason for such a huge difference in the speed ?
Can the REST client introduce such a huge difference?
I have to explain more about the job itself.
The job takes as an input arbitrary number of numeric entityIds (for example 10000) and time interval.
Then i partition the list if Ids into chunks of 100.
For each partition i do :
Fetch data from mh index alias. This is one bool query with range for the time interval and terms for the Id's
{
"size": 3000,
"query": {
"bool": {
"must": [
{
"bool": {
"must": [
{
"terms": {
"enity_id": [
1,
....
100
],
"boost": 1
}
},
{
"range": {
"ts": {
"from": 1550534400000,
"to": 1550620740000
}
}
},
{
"range": {
"mmsi": {
"from": 0,
"to": null
}
}
}
],
"must_not": [
{
"term": {
"aclient_id": {
"value": -1
}
}
}
]
}
}
]
}
},
"sort": [
{
"entity_id": {
"order": "asc"
}
},
{
"ts": {
"order": "asc"
}
}
]
}
This uses scroll
pretty much the same query against another index (the main big index)
Again scroll trough all the documents
The job should not be fast in general as it fetched large amount of documents, but i did not expect such a huge difference between the two ES version. What worries me as well, is the huge CPU burn
Alright,
migrated the code to the Java client (and not the REST client)
Things are way better. I get similar or better performance now.
It seems that the REST client is the problem.
I guess serializing big number of documents is burning the CPU.
That't too bad, as the Java client will be deprecated in v8. I hope they find a solution, as it seems REST is not an option for fetching a lot of docs from ES
@dadoonet, I am also curious how do you squeeze better performance out of the REST client.
May be my use case is not the most common one. But it is also very simple one.
I do have better performance but when it comes to bulk API.
I'm not using scroll API so I can't say if it's better or worse but I'm pretty sure I read here someone else saying it's slower for his use case.
May be you should reduce a bit the size?
On one hand the REST client is expected to be a bit slower than transport client, on the other hand not 10 times slower. To do a better comparison, instead of using transport client, it would be interesting to send some REST requests to Elasticsearch, for instance using curl or any other http client, and compare those response times with the ones obtained when using the java REST client. I suspect most of the time will be spent on handling json, but we should prove that with data.
Playing with different size parameters is indeed a good idea.
The response times (using Sense) are similar to the old ES. I don't think this is related to the ES version.
I mean... it is a bit hard to test like that as the caches are cold in the new cluster, so my response times vary. The first 10 request are slow (like 2000ms which is a bit too slow IMO), but after that one request takes < 100ms which is the case in the old cluster.
Reducing the size will only make the job run longer but it will not stop the heavy CPU burn.
I mean, the document that i need to fetch are constant. Playing with the size will just change how many request i make to ES
On the old cluster i use size of 2000-3000 (which if i recall is per shard !).
So i think size around 5000 (not per shard) would be reasonable.
Anyway i will try reducing the size to ... 1000 ? 2000?
I don't wanna go too low as this will make the job very slow.
Edit: could this be related to the serialization of geo_shape type ? This is the only filed that is not a keyword or numric
Just to verify i cobbled together some high level rest client code vs some apache client HttpPost code using the shakespeare index data set, which is trivial in complexity but has a decent ammt of size to it (over 100k documents). Doing this both ways yielded almost the same time in millis, of course the apache client being faster by approx 100ms (over 2800ms to traverse the whole thing with a 3000 size).
Given that you say the Sense queries are sane, we should be able to rule out the query itself, which means its likely deserialization of your documents from the wire. Im definitely still investigating, as you should definitely not have a 10x slowdown via the rest client. I will be creating a new test with geo shapes next. If you can provide any example about how large the geoshapes are in terms of points / type, that would be great. I plan on starting with some large ones anyway just to see if it causes a big slowdown.
Also, if you suspect that the geo stuff is causing the slowdown then remove it from the output to see if its causing the slowdown at deserialization. This could help point out whether the issue is the geo shapes or not since they wont be returned in the query response.
I have now indexed 100k entries that contain geo_shapes and ran the same benchmark, and I am not seeing much of a difference. Id love to see if removing the geo shapes from the output (as per my message above) makes things any faster. Im sure you have found a bug, its just hard to replicate w/o a dataset and query.
Yes but im trying to only exercise the rest client. So my assumption is that you are pulling them back in 3k batches, right? Ive done a scroll thru 100k entries at 3k size and tried to see if there was a difference in the parsing on the rest client, with the geo_shapes, and I did not see any slowdown in the rest client.
What are you doing with the responses? Are you just leaving them as a SearchResponse? If you remove all processing and only iterate over the search response results for each batch, is the time still significantly slower via the rest client? Id like to try to isolate the slowdown as much as possible, so if we can remove anything besides the actual rest client calls (the initial search request and all subsequent scroll requests), then maybe it can start to show us where the slowdown is.
Hi there,
I have to admit that there was a bug in my code overloading elasticsearch with a lot of queries which caused the high CPU burn on the cluster(plus some concurrency issues).
Today i have fixed that and i am now in the process of testing rest vs transport clients.
The first results look much better.
So far i see the REST client a bit slower, but my tests are not over.
At least i dont see high CPU burn on the cluster as before.
I also made a very simple test with just scrolling trough matchAll() query and i did not notice any problems with the REST or the transport client.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.