For requests with a very high terms count, Elasticsearch is taking multiple seconds to return an HTTP 400 error. We had an issue with a production system that produced a query that had 250K items in a terms query and this took ~16 seconds to return the HTTP 400. I would expect this validation error to be handled quickly.
Trying to replicate locally, I found that a single instance handled this type of request very quickly. But when running a three node docker cluster, the time it took Elasticsearch to handle this request increased by orders of magnitude as the shard count increased.
A single docker instance handling a query with 200k terms shows performance to be even across the board.
Command
Mean [ms]
Min [ms]
Max [ms]
Relative
SHARD_COUNT=1 bash ./make_query.sh
352.3 ± 36.6
316.0
399.4
1.23 ± 0.14
SHARD_COUNT=5 bash ./make_query.sh
352.9 ± 53.0
292.5
412.6
1.24 ± 0.19
SHARD_COUNT=10 bash ./make_query.sh
330.6 ± 83.8
257.8
474.5
1.16 ± 0.30
SHARD_COUNT=25 bash ./make_query.sh
291.3 ± 22.6
263.8
323.2
1.02 ± 0.09
SHARD_COUNT=50 bash ./make_query.sh
302.9 ± 35.8
271.8
364.6
1.06 ± 0.13
SHARD_COUNT=100 bash ./make_query.sh
346.9 ± 102.2
277.9
527.9
1.22 ± 0.36
SHARD_COUNT=150 bash ./make_query.sh
285.3 ± 11.3
272.3
301.8
1.00
SHARD_COUNT=200 bash ./make_query.sh
299.4 ± 15.9
290.2
327.7
1.05 ± 0.07
But a three node cluster, times increase a lot as shard count increases. This is all with an empty index and just to return a validation error:
Command
Mean [ms]
Min [ms]
Max [ms]
Relative
SHARD_COUNT=1 bash ./make_query.sh
393.2 ± 96.9
318.8
560.8
1.00
SHARD_COUNT=5 bash ./make_query.sh
500.3 ± 63.6
421.9
580.3
1.27 ± 0.35
SHARD_COUNT=10 bash ./make_query.sh
729.3 ± 107.9
603.1
871.1
1.86 ± 0.53
SHARD_COUNT=25 bash ./make_query.sh
1110.2 ± 83.0
1059.1
1255.5
2.82 ± 0.73
SHARD_COUNT=50 bash ./make_query.sh
1763.3 ± 110.9
1574.8
1868.9
4.49 ± 1.14
SHARD_COUNT=100 bash ./make_query.sh
3552.0 ± 148.6
3374.9
3785.6
9.03 ± 2.26
SHARD_COUNT=150 bash ./make_query.sh
10446.9 ± 485.8
9726.2
10856.2
26.57 ± 6.67
SHARD_COUNT=200 bash ./make_query.sh
13380.0 ± 326.5
13130.7
13933.6
34.03 ± 8.43
Even with an extreme example like this, I would not expect Elasticsearch to take 13 seconds against an empty index to return a validation error. I'm happy to try to dig into this more, are there any areas in the code base that you would start to look at?
This is a really interesting topic, thanks for raising it and including all the extra info you did.
There's quite a few people on leave at the moment, so chances are it may not get attention until next week
I ran the provided benchmarks and captured stack dumps and saw it spending quite a bit of its time sending can-match requests, stuff like this:
at org.elasticsearch.common.io.stream.BytesStreamOutput.writeByte(BytesStreamOutput.java:78)
at org.elasticsearch.common.io.stream.StreamOutput.write(StreamOutput.java:520)
at org.elasticsearch.transport.CompressibleBytesOutputStream.writeByte(CompressibleBytesOutputStream.java:81)
at org.elasticsearch.common.io.stream.StreamOutput.lambda$static$25(StreamOutput.java:777)
at org.elasticsearch.common.io.stream.StreamOutput$$Lambda$1308/0x000000080108bc40.write(Unknown Source)
at org.elasticsearch.common.io.stream.StreamOutput.writeGenericValue(StreamOutput.java:846)
at org.elasticsearch.common.io.stream.StreamOutput.lambda$static$12(StreamOutput.java:709)
at org.elasticsearch.common.io.stream.StreamOutput$$Lambda$1295/0x0000000801088840.write(Unknown Source)
at org.elasticsearch.common.io.stream.StreamOutput.writeGenericValue(StreamOutput.java:846)
at org.elasticsearch.index.query.TermsQueryBuilder.doWriteTo(TermsQueryBuilder.java:198)
at org.elasticsearch.index.query.AbstractQueryBuilder.writeTo(AbstractQueryBuilder.java:79)
at org.elasticsearch.common.io.stream.StreamOutput.writeNamedWriteable(StreamOutput.java:1115)
at org.elasticsearch.index.query.AbstractQueryBuilder.writeQueries(AbstractQueryBuilder.java:256)
at org.elasticsearch.index.query.BoolQueryBuilder.doWriteTo(BoolQueryBuilder.java:103)
at org.elasticsearch.index.query.AbstractQueryBuilder.writeTo(AbstractQueryBuilder.java:79)
at org.elasticsearch.common.io.stream.StreamOutput.writeNamedWriteable(StreamOutput.java:1115)
at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalNamedWriteable(StreamOutput.java:1126)
at org.elasticsearch.search.builder.SearchSourceBuilder.writeTo(SearchSourceBuilder.java:301)
at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:976)
at org.elasticsearch.search.internal.ShardSearchRequest.innerWriteTo(ShardSearchRequest.java:266)
at org.elasticsearch.search.internal.ShardSearchRequest.writeTo(ShardSearchRequest.java:255)
at org.elasticsearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:87)
at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:64)
at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:163)
at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:149)
at org.elasticsearch.transport.OutboundHandler$SendContext.get(OutboundHandler.java:196)
at org.elasticsearch.transport.OutboundHandler.internalSend(OutboundHandler.java:130)
at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:125)
at org.elasticsearch.transport.OutboundHandler.sendRequest(OutboundHandler.java:89)
at org.elasticsearch.transport.TcpTransport$NodeChannels.sendRequest(TcpTransport.java:270)
at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:747)
at org.elasticsearch.transport.TransportService$$Lambda$3621/0x000000080164f040.sendRequest(Unknown Source)
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor.sendWithUser(SecurityServerTransportInterceptor.java:162)
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor.access$300(SecurityServerTransportInterceptor.java:53)
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$1.sendRequest(SecurityServerTransportInterceptor.java:129)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:661)
at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:712)
at org.elasticsearch.action.search.SearchTransportService.sendCanMatch(SearchTransportService.java:116)
at org.elasticsearch.action.search.CanMatchPreFilterSearchPhase.executePhaseOnShard(CanMatchPreFilterSearchPhase.java:88)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:244)
at org.elasticsearch.action.search.AbstractSearchAsyncAction$$Lambda$5672/0x0000000801b04440.run(Unknown Source)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:279)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.run(AbstractSearchAsyncAction.java:215)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:365)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:182)
at org.elasticsearch.action.search.TransportSearchAction.executeSearch(TransportSearchAction.java:685)
at org.elasticsearch.action.search.TransportSearchAction.executeLocalSearch(TransportSearchAction.java:511)
at org.elasticsearch.action.search.TransportSearchAction.lambda$executeRequest$3(TransportSearchAction.java:303)
at org.elasticsearch.action.search.TransportSearchAction$$Lambda$5629/0x0000000801af8c40.accept(Unknown Source)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)
at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:114)
at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:87)
at org.elasticsearch.action.search.TransportSearchAction.executeRequest(TransportSearchAction.java:346)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:229)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:105)
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:179)
There was also a warning logged about spending too much time on the transport thread:
[2021-01-05T18:34:03,672][WARN ][o.e.t.InboundHandler ] [node-0] handling inbound transport message [InboundMessage{Header{56}{7.10.1}{3369}{false}{false}{false}{false}{NO_ACTION_NAME_FOR_RESPONSES}}] took [5024ms] which is above the warn threshold of [5000ms]
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.