Terms query with large number of items increase latency by number of shards

For requests with a very high terms count, Elasticsearch is taking multiple seconds to return an HTTP 400 error. We had an issue with a production system that produced a query that had 250K items in a terms query and this took ~16 seconds to return the HTTP 400. I would expect this validation error to be handled quickly.

Trying to replicate locally, I found that a single instance handled this type of request very quickly. But when running a three node docker cluster, the time it took Elasticsearch to handle this request increased by orders of magnitude as the shard count increased.

I documented results in this repo: https://github.com/nickcanz/es-terms-benchmarking but replicating those tables here:

A single docker instance handling a query with 200k terms shows performance to be even across the board.

Command Mean [ms] Min [ms] Max [ms] Relative
SHARD_COUNT=1 bash ./make_query.sh 352.3 ± 36.6 316.0 399.4 1.23 ± 0.14
SHARD_COUNT=5 bash ./make_query.sh 352.9 ± 53.0 292.5 412.6 1.24 ± 0.19
SHARD_COUNT=10 bash ./make_query.sh 330.6 ± 83.8 257.8 474.5 1.16 ± 0.30
SHARD_COUNT=25 bash ./make_query.sh 291.3 ± 22.6 263.8 323.2 1.02 ± 0.09
SHARD_COUNT=50 bash ./make_query.sh 302.9 ± 35.8 271.8 364.6 1.06 ± 0.13
SHARD_COUNT=100 bash ./make_query.sh 346.9 ± 102.2 277.9 527.9 1.22 ± 0.36
SHARD_COUNT=150 bash ./make_query.sh 285.3 ± 11.3 272.3 301.8 1.00
SHARD_COUNT=200 bash ./make_query.sh 299.4 ± 15.9 290.2 327.7 1.05 ± 0.07

But a three node cluster, times increase a lot as shard count increases. This is all with an empty index and just to return a validation error:

Command Mean [ms] Min [ms] Max [ms] Relative
SHARD_COUNT=1 bash ./make_query.sh 393.2 ± 96.9 318.8 560.8 1.00
SHARD_COUNT=5 bash ./make_query.sh 500.3 ± 63.6 421.9 580.3 1.27 ± 0.35
SHARD_COUNT=10 bash ./make_query.sh 729.3 ± 107.9 603.1 871.1 1.86 ± 0.53
SHARD_COUNT=25 bash ./make_query.sh 1110.2 ± 83.0 1059.1 1255.5 2.82 ± 0.73
SHARD_COUNT=50 bash ./make_query.sh 1763.3 ± 110.9 1574.8 1868.9 4.49 ± 1.14
SHARD_COUNT=100 bash ./make_query.sh 3552.0 ± 148.6 3374.9 3785.6 9.03 ± 2.26
SHARD_COUNT=150 bash ./make_query.sh 10446.9 ± 485.8 9726.2 10856.2 26.57 ± 6.67
SHARD_COUNT=200 bash ./make_query.sh 13380.0 ± 326.5 13130.7 13933.6 34.03 ± 8.43

Even with an extreme example like this, I would not expect Elasticsearch to take 13 seconds against an empty index to return a validation error. I'm happy to try to dig into this more, are there any areas in the code base that you would start to look at?

2 Likes

This is a really interesting topic, thanks for raising it and including all the extra info you did.
There's quite a few people on leave at the moment, so chances are it may not get attention until next week :slight_smile:

@warkolm would you recommend making a GitHub issue in the elasticsearch repo or leave this as a discussion?

I ran the provided benchmarks and captured stack dumps and saw it spending quite a bit of its time sending can-match requests, stuff like this:

        at org.elasticsearch.common.io.stream.BytesStreamOutput.writeByte(BytesStreamOutput.java:78)
        at org.elasticsearch.common.io.stream.StreamOutput.write(StreamOutput.java:520)
        at org.elasticsearch.transport.CompressibleBytesOutputStream.writeByte(CompressibleBytesOutputStream.java:81)
        at org.elasticsearch.common.io.stream.StreamOutput.lambda$static$25(StreamOutput.java:777)
        at org.elasticsearch.common.io.stream.StreamOutput$$Lambda$1308/0x000000080108bc40.write(Unknown Source)
        at org.elasticsearch.common.io.stream.StreamOutput.writeGenericValue(StreamOutput.java:846)
        at org.elasticsearch.common.io.stream.StreamOutput.lambda$static$12(StreamOutput.java:709)
        at org.elasticsearch.common.io.stream.StreamOutput$$Lambda$1295/0x0000000801088840.write(Unknown Source)
        at org.elasticsearch.common.io.stream.StreamOutput.writeGenericValue(StreamOutput.java:846)
        at org.elasticsearch.index.query.TermsQueryBuilder.doWriteTo(TermsQueryBuilder.java:198)
        at org.elasticsearch.index.query.AbstractQueryBuilder.writeTo(AbstractQueryBuilder.java:79)
        at org.elasticsearch.common.io.stream.StreamOutput.writeNamedWriteable(StreamOutput.java:1115)
        at org.elasticsearch.index.query.AbstractQueryBuilder.writeQueries(AbstractQueryBuilder.java:256)
        at org.elasticsearch.index.query.BoolQueryBuilder.doWriteTo(BoolQueryBuilder.java:103)
        at org.elasticsearch.index.query.AbstractQueryBuilder.writeTo(AbstractQueryBuilder.java:79)
        at org.elasticsearch.common.io.stream.StreamOutput.writeNamedWriteable(StreamOutput.java:1115)
        at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalNamedWriteable(StreamOutput.java:1126)
        at org.elasticsearch.search.builder.SearchSourceBuilder.writeTo(SearchSourceBuilder.java:301)
        at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:976)
        at org.elasticsearch.search.internal.ShardSearchRequest.innerWriteTo(ShardSearchRequest.java:266)
        at org.elasticsearch.search.internal.ShardSearchRequest.writeTo(ShardSearchRequest.java:255)
        at org.elasticsearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:87)
        at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:64)
        at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:163)
        at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:149)
        at org.elasticsearch.transport.OutboundHandler$SendContext.get(OutboundHandler.java:196)
        at org.elasticsearch.transport.OutboundHandler.internalSend(OutboundHandler.java:130)
        at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:125)
        at org.elasticsearch.transport.OutboundHandler.sendRequest(OutboundHandler.java:89)
        at org.elasticsearch.transport.TcpTransport$NodeChannels.sendRequest(TcpTransport.java:270)
        at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:747)
        at org.elasticsearch.transport.TransportService$$Lambda$3621/0x000000080164f040.sendRequest(Unknown Source)
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor.sendWithUser(SecurityServerTransportInterceptor.java:162)
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor.access$300(SecurityServerTransportInterceptor.java:53)
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$1.sendRequest(SecurityServerTransportInterceptor.java:129)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:661)
        at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:712)
        at org.elasticsearch.action.search.SearchTransportService.sendCanMatch(SearchTransportService.java:116)
        at org.elasticsearch.action.search.CanMatchPreFilterSearchPhase.executePhaseOnShard(CanMatchPreFilterSearchPhase.java:88)
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:244)
        at org.elasticsearch.action.search.AbstractSearchAsyncAction$$Lambda$5672/0x0000000801b04440.run(Unknown Source)
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:279)
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.run(AbstractSearchAsyncAction.java:215)
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:365)
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:182)
        at org.elasticsearch.action.search.TransportSearchAction.executeSearch(TransportSearchAction.java:685)
        at org.elasticsearch.action.search.TransportSearchAction.executeLocalSearch(TransportSearchAction.java:511)
        at org.elasticsearch.action.search.TransportSearchAction.lambda$executeRequest$3(TransportSearchAction.java:303)
        at org.elasticsearch.action.search.TransportSearchAction$$Lambda$5629/0x0000000801af8c40.accept(Unknown Source)
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63)
        at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:114)
        at org.elasticsearch.index.query.Rewriteable.rewriteAndFetch(Rewriteable.java:87)
        at org.elasticsearch.action.search.TransportSearchAction.executeRequest(TransportSearchAction.java:346)
        at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:229)
        at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:105)
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:179)

There was also a warning logged about spending too much time on the transport thread:

[2021-01-05T18:34:03,672][WARN ][o.e.t.InboundHandler     ] [node-0] handling inbound transport message [InboundMessage{Header{56}{7.10.1}{3369}{false}{false}{false}{false}{NO_ACTION_NAME_FOR_RESPONSES}}] took [5024ms] which is above the warn threshold of [5000ms]

I think it's worth reporting as a bug, thanks.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.