How do we use ES?
Our team is using ES to index ads metadata. When we need to show a user ads, we query ES for a list of ads that satisfy certain criteria (e.g. geo, queries, campaign date range, etc). Upon fetching the list of candidate ad ids (UUIDs), we do various hydration on top of this list.
How many docs do we retrieve?
We have some popular queries that could hit Ks of potential ads. For us, it is not possible to simply choose, say, top 100. The reason being that, our post processing has to run various checks and eventually an RTB auction on top of the candidate ads.
What are the challenges?
We have a cluster of 18 data nodes. Our index is around 13GB. We have a 1 master shard + 17 replica shards. We are not able to retrieve more than 1000 docs under 1s. Even when we reduce our query to its simplest form (i.e. a simple filter term query on an ad keyword), we don't see any performance improvement. When enabling slow query log, we can see that the fetch phase takes the majority of the time.
We would like to know if there is any optimization we can perform to make the retrieval faster.
It looks like you are using AWS Elasticsearch/Opensearch Service. If this is the case this has third party plugins installed that could affect performance and we do not necessarily know how the cluster is configured.
Your nodes have enough RAM to keep the full index in the operating system file cache so retrieval of data from disk should not be an issue. CPU and networking throughput also seems plentiful, so I see no obvious bottlenecks. Your cluster actually seems quite oversized, although I assume it is serving a large number of queries per second.
Given that you have quite sizeable documents and want to return a large number of them it is possible that serialisation of the result set for the response might be slow, but I do not know how that would show up in stats or whether it could be impacted by third party plugins. Returning 1000 results of 100kB each gives a quite large response size (100MB if I calculate correctly), so I suspect that might be the bottleneck if your numbers are correct.
You may want to try to reduce the size of the documents if possible. Remove fields you are not querying or filtering on (if possible) and query this index. If you need the fill document to pass back you might be able to store this in a separate index and run a direct GET request once you have determined which one to return.
OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance.
(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns )
Got it. We do have a large number of plugins (> 1000) loaded.
Out of curiosity, given that our query does not return the entire _source (but only certain fields), would serialization still play a role in our performance issue? Or does serialization of the entire doc happens regardless of what fields we choose to return?
Whether the full document need to be parsed or not depend on how you specify the fields to be returned and how these are mapped (see docs). This is something I think has changed in more recent versions and I do not remember exactly how 7.10 works as it is quite old and EOL.
If there always is a set number of fields you return, you may want to try using stored fields.
After running some quick benchmark, here are the results.
Indexed 2000 docs
Each doc has
field1 - field3 - each is a single random UUID keyword
field4 - a fixed keyword foobar
field5 - field8 - each a list of 1000 random UUID keywords
Each query is a term filter on field4 and a size of 500
Fields to return: field1 - field3
Three test scenarios (in Kotlin):
val sourceBuilderTemplate = SearchSourceBuilder()
.query(
QueryBuilders.boolQuery()
.filter(QueryBuilders.termQuery("field4", "foobar"))
)
.size(500)
.timeout(TimeValue.timeValueMillis(2000))
private fun searchTest1(sourceBuilderTemplate: SearchSourceBuilder): SearchResponse {
val sourceBuilder = sourceBuilderTemplate
.fetchSource(
FetchSourceContext(
true,
arrayOf("field1", "field2", "field3"),
emptyArray()
)
)
val request = SearchRequest("test_index").source(sourceBuilder)
return client.search(request)
}
private fun searchTest2(sourceBuilderTemplate: SearchSourceBuilder): SearchResponse {
val sourceBuilder = sourceBuilderTemplate
.docValueField("field1")
.docValueField("field2")
.docValueField("field3")
.fetchSource(
FetchSourceContext(
false
)
)
val request = SearchRequest("test_index").source(sourceBuilder)
return client.search(request)
}
private fun searchTest3(sourceBuilderTemplate: SearchSourceBuilder): SearchResponse {
val sourceBuilder = sourceBuilderTemplate
.storedFields(listOf("field1", "field2", "field3"))
.fetchSource(
FetchSourceContext(false)
)
val request = SearchRequest("test_index").source(sourceBuilder)
return client.search(request)
}
We ran each test case 100 times and take the avg. latency:
searchTest1 - 967ms
searchTest2 - 768ms
searchTest3 - 883ms
Looks like we do see some performance improvement for both strategies. However, we are still yet to get the latency down to something we can use for production.
Can you spin up a standard Elasticsearch 7.17 on a single node with enough RAM to fit the full test index into the operating system page cache and rerun the test there for comparison?
The typical search consists of two phases of remote calls - the query phase followed by the fetch phase.
If you set the ‘hits’ size to zero you eliminate the fetch phase.
If you add a ‘top_hits’ aggregation into the request a selection of top matches will be gathered in the query phase.
This approach may be worth experimenting with. It may read more results overall (some shards may return docs that miss the final cut) but it reduces the number of round trips.
I'll get back to the page cache testing later when I get a chance.
Regarding query vs. fetching phases, our previous benchmarking (with a diff index but similar query pattern) showed that fetching phase took majority of the time.
In terms of using top_hits aggregation, in our use case, given that we are doing simply term filtering and don't calculate relevance, this might not work for us. In addition, we will need all matching docs instead of just the "top" ones.
But given that I'm hitting the same hits over and over again, very curious to see how this query lands in production. Will report back. Thx!
UPDATE - Unfortunately, we don't see much difference in any of these strategies (including the aggregation one) when testing on production. My guess is that, the testing data are somehow highly cacheable, which is not the case on prod.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.