Is Elasticsearch suitable for efficient retrieval of large number of docs?

How do we use ES?
Our team is using ES to index ads metadata. When we need to show a user ads, we query ES for a list of ads that satisfy certain criteria (e.g. geo, queries, campaign date range, etc). Upon fetching the list of candidate ad ids (UUIDs), we do various hydration on top of this list.

How many docs do we retrieve?
We have some popular queries that could hit Ks of potential ads. For us, it is not possible to simply choose, say, top 100. The reason being that, our post processing has to run various checks and eventually an RTB auction on top of the candidate ads.

What are the challenges?
We have a cluster of 18 data nodes. Our index is around 13GB. We have a 1 master shard + 17 replica shards. We are not able to retrieve more than 1000 docs under 1s. Even when we reduce our query to its simplest form (i.e. a simple filter term query on an ad keyword), we don't see any performance improvement. When enabling slow query log, we can see that the fetch phase takes the majority of the time.

We would like to know if there is any optimization we can perform to make the retrieval faster.

1 Like

Which version of Elasticsearch are you using?

What is the specification of the nodes in term of CPU, RAM, heap size and storage type?

Do you have any other data in the cluster that could affect performance?

How fast is your query if you retrieve a smaller number of results, e.g. 100?

How large are your documents on average?

Are the indices relatively static or are you continously indexing and updating data?

Version: 7.10
Nodes: r6g.8xlarge.search
Storage: EBS provisioned IOPS (SSD)

CPU utilization is < 10%. JVM mem pressure max 80% and min 10%. System mem util 20%.

The cluster only has this index. No other data should affect performance.

Small query size of 100 takes 200ms. (A size of 1000 actually takes > 1s. Corrected in the original post.).

Document size I would guess about 100 - 150KB. But our query does not return the source. Only the ids.

Our indexing rate is very minimum. Less than 5 QPS.

Also, we realized that once query size is larger than 1000, query thread pool started to get queued and performance basically slow to a crawl.

It looks like you are using AWS Elasticsearch/Opensearch Service. If this is the case this has third party plugins installed that could affect performance and we do not necessarily know how the cluster is configured.

Your nodes have enough RAM to keep the full index in the operating system file cache so retrieval of data from disk should not be an issue. CPU and networking throughput also seems plentiful, so I see no obvious bottlenecks. Your cluster actually seems quite oversized, although I assume it is serving a large number of queries per second.

Given that you have quite sizeable documents and want to return a large number of them it is possible that serialisation of the result set for the response might be slow, but I do not know how that would show up in stats or whether it could be impacted by third party plugins. Returning 1000 results of 100kB each gives a quite large response size (100MB if I calculate correctly), so I suspect that might be the bottleneck if your numbers are correct.

You may want to try to reduce the size of the documents if possible. Remove fields you are not querying or filtering on (if possible) and query this index. If you need the fill document to pass back you might be able to store this in a separate index and run a direct GET request once you have determined which one to return.

OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

Got it. We do have a large number of plugins (> 1000) loaded.

Out of curiosity, given that our query does not return the entire _source (but only certain fields), would serialization still play a role in our performance issue? Or does serialization of the entire doc happens regardless of what fields we choose to return?

Whether the full document need to be parsed or not depend on how you specify the fields to be returned and how these are mapped (see docs). This is something I think has changed in more recent versions and I do not remember exactly how 7.10 works as it is quite old and EOL.

If there always is a set number of fields you return, you may want to try using stored fields.

Great. Thanks for the pointers. Will look into stored fields.

One of the other approaches described in the docs may slso work but try to not have to deserialise the full source.

Please let us know how it goes.

After running some quick benchmark, here are the results.

  • Indexed 2000 docs
  • Each doc has
    • field1 - field3 - each is a single random UUID keyword
    • field4 - a fixed keyword foobar
    • field5 - field8 - each a list of 1000 random UUID keywords
  • Each query is a term filter on field4 and a size of 500
  • Fields to return: field1 - field3

Three test scenarios (in Kotlin):

val sourceBuilderTemplate = SearchSourceBuilder()
        .query(
            QueryBuilders.boolQuery()
                .filter(QueryBuilders.termQuery("field4", "foobar"))
        )
        .size(500)
        .timeout(TimeValue.timeValueMillis(2000))

private fun searchTest1(sourceBuilderTemplate: SearchSourceBuilder): SearchResponse {
    val sourceBuilder = sourceBuilderTemplate
        .fetchSource(
            FetchSourceContext(
                true,
                arrayOf("field1", "field2", "field3"),
                emptyArray()
            )
        )

    val request = SearchRequest("test_index").source(sourceBuilder)
    return client.search(request)
}

private fun searchTest2(sourceBuilderTemplate: SearchSourceBuilder): SearchResponse {
    val sourceBuilder = sourceBuilderTemplate
        .docValueField("field1")
        .docValueField("field2")
        .docValueField("field3")
        .fetchSource(
            FetchSourceContext(
                false
            )
        )

    val request = SearchRequest("test_index").source(sourceBuilder)
    return client.search(request)
}

private fun searchTest3(sourceBuilderTemplate: SearchSourceBuilder): SearchResponse {
    val sourceBuilder = sourceBuilderTemplate
        .storedFields(listOf("field1", "field2", "field3"))
        .fetchSource(
            FetchSourceContext(false)
        )

    val request = SearchRequest("test_index").source(sourceBuilder)
    return client.search(request)
}

We ran each test case 100 times and take the avg. latency:

  1. searchTest1 - 967ms
  2. searchTest2 - 768ms
  3. searchTest3 - 883ms

Looks like we do see some performance improvement for both strategies. However, we are still yet to get the latency down to something we can use for production.

That is surprising.

What is the mapping for the test index?

Can you spin up a standard Elasticsearch 7.17 on a single node with enough RAM to fit the full test index into the operating system page cache and rerun the test there for comparison?

The typical search consists of two phases of remote calls - the query phase followed by the fetch phase.
If you set the ‘hits’ size to zero you eliminate the fetch phase.
If you add a ‘top_hits’ aggregation into the request a selection of top matches will be gathered in the query phase.
This approach may be worth experimenting with. It may read more results overall (some shards may return docs that miss the final cut) but it reduces the number of round trips.

The mapping + settings:

{
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": "17"
    },
    "mappings": {
        "dynamic": "false",
        "properties": {
            "field1": {
                "type": "keyword",
                "store": true
            },
            "field2": {
                "type": "keyword",
                "store": true
            },
            "field3": {
                "type": "keyword",
                "store": true
            },
            "field4": {
                "type": "keyword"
            },
            "field5": {
                "type": "keyword"
            },
            "field6": {
                "type": "keyword"
            },
            "field7": {
                "type": "keyword"
            },
            "field8": {
                "type": "keyword"
            }
        }
    }
}

I'll get back to the page cache testing later when I get a chance.

Regarding query vs. fetching phases, our previous benchmarking (with a diff index but similar query pattern) showed that fetching phase took majority of the time.

In terms of using top_hits aggregation, in our use case, given that we are doing simply term filtering and don't calculate relevance, this might not work for us. In addition, we will need all matching docs instead of just the "top" ones.

Hit gathering is mostly the same as top_hits gathering but done in a different phase. Both can be used with or without relevance scoring

Very interesting. Created a test case using top_hits:

private fun searchTest4(sourceBuilderTemplate: SearchSourceBuilder): SearchResponse {
    val sourceBuilder = sourceBuilderTemplate
        .size(0)
        .aggregation(
            AggregationBuilders
                .topHits("top_match")
                .fetchSource(
                    FetchSourceContext(
                        true,
                        arrayOf("field1", "field2", "field3"),
                        emptyArray()
                    )
                )
                .size(500)
        )
        .fetchSource(
            FetchSourceContext(false)
        )

    val request = SearchRequest("test_index").source(sourceBuilder)
    return client.search(request)
}

Able to get latency down to 194ms!

But given that I'm hitting the same hits over and over again, very curious to see how this query lands in production. Will report back. Thx!

UPDATE - Unfortunately, we don't see much difference in any of these strategies (including the aggregation one) when testing on production. My guess is that, the testing data are somehow highly cacheable, which is not the case on prod.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.