'_source' filtering is slower than query without '_source' field

nadeem.akhter · May 16, 2023, 10:45am

I have an elasticsearch instance with some data on it, and when trying queries on the data, it is slower to filter '_source' in query than not mentioning the '_source' key at all. Is there any specific reason for this?

Profiling the queries showed FetchSourcePhase taking much more time with source filtering on the order of seconds, compared to without '_source'

nadeem.akhter · May 22, 2023, 6:17am

@stephenb Could you take a look at this? I am using Elasticsearch 8.6.0

Christian_Dahlqvist · May 22, 2023, 7:40am

This forum is manned by volunteers, even if they work at Elastic, so it is considered rude to ping people not already involved in the thread. You have also provided very little information to go on. Please provide the two queries you are comparing together with some information about the difference in latency, the nature and size of the data queried and information about the cluster itself. It would also be useful if you could post the profiling information from both queries.

nadeem.akhter · May 24, 2023, 7:25am

Thank you for the information regarding the thread,

In regard to additional information in relation to the question, we can take a simple query with size set to 100:

{
    "query": {"bool": {"must": [{"match_all": {}}], "must_not": [], "should": []}},
    "from": 0,
    "size": 100,
    "sort": [],
    "aggs": {},
    "_source": [
        "field1",
        "field2",
        "field3",
        "field4",
        "field5",
        "field6",
        "field7",
        "field8",
        "field9",
        "field10",
        "field11",
        "field12",
        "field13”,
        "field14",
        "field15",
        "field16",
    ]
}

Elasticsearch reports time taken to finish this query is more than a second, approximately 1200-1300 ms. The profile information for this query is:

{
    "shards": [
        {
            "id": "[shard_id][sample_index][0]",
            "searches": [
                {
                    "query": [
                        {
                            "type": "ConstantScoreQuery",
                            "description": "ConstantScore(FieldExistsQuery [field=_primary_term])",
                            "time_in_nanos": 194619,
                            "breakdown": {
                                "set_min_competitive_score_count": 0,
                                "match_count": 0,
                                "shallow_advance_count": 0,
                                "set_min_competitive_score": 0,
                                "next_doc": 85247,
                                "match": 0,
                                "next_doc_count": 558,
                                "score_count": 558,
                                "compute_max_score_count": 0,
                                "compute_max_score": 0,
                                "advance": 9136,
                                "advance_count": 10,
                                "score": 23042,
                                "build_scorer_count": 20,
                                "create_weight": 4883,
                                "shallow_advance": 0,
                                "create_weight_count": 1,
                                "build_scorer": 72311
                            },
                            "children": [
                                {
                                    "type": "FieldExistsQuery",
                                    "description": "FieldExistsQuery [field=_primary_term]",
                                    "time_in_nanos": 94503,
                                    "breakdown": {
                                        "set_min_competitive_score_count": 0,
                                        "match_count": 0,
                                        "shallow_advance_count": 0,
                                        "set_min_competitive_score": 0,
                                        "next_doc": 38538,
                                        "match": 0,
                                        "next_doc_count": 558,
                                        "score_count": 0,
                                        "compute_max_score_count": 0,
                                        "compute_max_score": 0,
                                        "advance": 8169,
                                        "advance_count": 10,
                                        "score": 0,
                                        "build_scorer_count": 20,
                                        "create_weight": 1730,
                                        "shallow_advance": 0,
                                        "create_weight_count": 1,
                                        "build_scorer": 46066
                                    }
                                }
                            ]
                        }
                    ],
                    "rewrite_time": 74439,
                    "collector": [
                        {
                            "name": "MultiCollector",
                            "reason": "search_multi",
                            "time_in_nanos": 297006,
                            "children": [
                                {
                                    "name": "SimpleTopScoreDocCollector",
                                    "reason": "search_top_hits",
                                    "time_in_nanos": 93227
                                },
                                {
                                    "name": "BucketCollectorWrapper: [BucketCollectorWrapper[bucketCollector=org.elasticsearch.search.aggregations.BucketCollector$1@ID]]",
                                    "reason": "aggregation",
                                    "time_in_nanos": 39868
                                }
                            ]
                        }
                    ]
                }
            ],
            "aggregations": [],
            "fetch": {
                "type": "fetch",
                "description": "",
                "time_in_nanos": 1468543241,
                "breakdown": {
                    "load_stored_fields": 335731591,
                    "load_source": 597153,
                    "load_stored_fields_count": 100,
                    "next_reader_count": 4,
                    "load_source_count": 100,
                    "next_reader": 475473
                },
                "debug": {
                    "stored_fields": [
                        "_id",
                        "_routing",
                        "_source"
                    ]
                },
                "children": [
                    {
                        "type": "FetchSourcePhase",
                        "description": "",
                        "time_in_nanos": 1128076983,
                        "breakdown": {
                            "process_count": 100,
                            "process": 1128072393,
                            "next_reader": 4590,
                            "next_reader_count": 4
                        },
                        "debug": {
                            "fast_path": 0
                        }
                    },
                    {
                        "type": "StoredFieldsPhase",
                        "description": "",
                        "time_in_nanos": 2392441,
                        "breakdown": {
                            "process_count": 100,
                            "process": 2380853,
                            "next_reader": 11588,
                            "next_reader_count": 4
                        }
                    }
                ]
            }
        }
    ]
}

However, if we remove source filtering:

{    
    "query": {"bool": {"must": [{"match_all": {}}], "must_not": [], "should": []}},
    "from": 0,
    "size": 100,
    "sort": [],
    "aggs": {}
}

The query takes only around 280-300 ms. The profile information for the query is:

{
    "shards": [
        {
            "id": "[shard_id][sample_index][0]",
            "searches": [
                {
                    "query": [
                        {
                            "type": "ConstantScoreQuery",
                            "description": "ConstantScore(FieldExistsQuery [field=_primary_term])",
                            "time_in_nanos": 339290,
                            "breakdown": {
                                "set_min_competitive_score_count": 0,
                                "match_count": 0,
                                "shallow_advance_count": 0,
                                "set_min_competitive_score": 0,
                                "next_doc": 161142,
                                "match": 0,
                                "next_doc_count": 558,
                                "score_count": 558,
                                "compute_max_score_count": 0,
                                "compute_max_score": 0,
                                "advance": 10104,
                                "advance_count": 10,
                                "score": 34467,
                                "build_scorer_count": 20,
                                "create_weight": 4705,
                                "shallow_advance": 0,
                                "create_weight_count": 1,
                                "build_scorer": 128872
                            },
                            "children": [
                                {
                                    "type": "FieldExistsQuery",
                                    "description": "FieldExistsQuery [field=_primary_term]",
                                    "time_in_nanos": 164376,
                                    "breakdown": {
                                        "set_min_competitive_score_count": 0,
                                        "match_count": 0,
                                        "shallow_advance_count": 0,
                                        "set_min_competitive_score": 0,
                                        "next_doc": 92178,
                                        "match": 0,
                                        "next_doc_count": 558,
                                        "score_count": 0,
                                        "compute_max_score_count": 0,
                                        "compute_max_score": 0,
                                        "advance": 8662,
                                        "advance_count": 10,
                                        "score": 0,
                                        "build_scorer_count": 20,
                                        "create_weight": 1497,
                                        "shallow_advance": 0,
                                        "create_weight_count": 1,
                                        "build_scorer": 62039
                                    }
                                }
                            ]
                        }
                    ],
                    "rewrite_time": 63997,
                    "collector": [
                        {
                            "name": "MultiCollector",
                            "reason": "search_multi",
                            "time_in_nanos": 461468,
                            "children": [
                                {
                                    "name": "SimpleTopScoreDocCollector",
                                    "reason": "search_top_hits",
                                    "time_in_nanos": 144301
                                },
                                {
                                    "name": "BucketCollectorWrapper: [BucketCollectorWrapper[bucketCollector=org.elasticsearch.search.aggregations.BucketCollector$1@ID]]",
                                    "reason": "aggregation",
                                    "time_in_nanos": 79751
                                }
                            ]
                        }
                    ]
                }
            ],
            "aggregations": [],
            "fetch": {
                "type": "fetch",
                "description": "",
                "time_in_nanos": 284016497,
                "breakdown": {
                    "load_stored_fields": 282148220,
                    "load_source": 150140,
                    "load_stored_fields_count": 100,
                    "next_reader_count": 4,
                    "load_source_count": 100,
                    "next_reader": 435831
                },
                "debug": {
                    "stored_fields": [
                        "_id",
                        "_routing",
                        "_source"
                    ]
                },
                "children": [
                    {
                        "type": "FetchSourcePhase",
                        "description": "",
                        "time_in_nanos": 232649,
                        "breakdown": {
                            "process_count": 100,
                            "process": 228559,
                            "next_reader": 4090,
                            "next_reader_count": 4
                        },
                        "debug": {
                            "fast_path": 100
                        }
                    },
                    {
                        "type": "StoredFieldsPhase",
                        "description": "",
                        "time_in_nanos": 617691,
                        "breakdown": {
                            "process_count": 100,
                            "process": 607069,
                            "next_reader": 10622,
                            "next_reader_count": 4
                        }
                    }
                ]
            }
        }
    ]
}

I have been using Elasticsearch 8.6.0.

The number of documents in the index is 560 and each document is around 550 kilobytes. It has 1 shard and 1 replica.

The JVM heap size of the cluster is 4 GiB and the memory of the cluster is 8 GiB with 4 allocated processors, in a Linux environment (Ubuntu 20.04.5 LTS).

The data mapping consists of mostly text fields and a few vector fields. There is one nested data type with various text subfields and a few numeric types, (long and float).

If some more information is needed, please let me know.

Christian_Dahlqvist · May 24, 2023, 7:28am

The query that is faster does a lot less work as it does not need to parse and extract fields from the source, so I would expect this to be faster. The first query need to parse the documents and extract 16 fields, which given that your documents are quite large will require a lot of extra work. As you have a single primary shard all of this work is done in a single thread, which is why I suspect you are seeing the difference in latency.

If you always want to retrieve the same set of fields you might want to look into using stored fields, which would avoid parsing the source. I am not sure it will be faster, but it could be worth testing.

nadeem.akhter · May 24, 2023, 7:41am

Thank you for your answer,

For another example, I have been using a single field in '_source'.

{
    "query": {"bool": {"must": [{"match_all": {}}], "must_not": [], "should": []}},
    "from": 0,
    "size": 100,
    "sort": [],
    "aggs": {},
    "_source": ["field1"]
}

This query takes more than 800 milliseconds to finish. And when using the same query without source like before,

{
    "query": {"bool": {"must": [{"match_all": {}}], "must_not": [], "should": []}},
    "from": 0,
    "size": 100,
    "sort": [],
    "aggs": {}
}

The time taken is the same as mentioned before, 280-300 milliseconds.

Is the reason for such a large difference in the 'took' times, the same even when filtering a single field?

Christian_Dahlqvist · May 24, 2023, 7:43am

The source is stored as a string and need to be parsed before any field can be extracted. This naturally takes longer than just returning the string from disk, especially if your documents are large.

This is described in the docs I linked to in my earlier response.

stephenb · May 24, 2023, 2:17pm

What are you trying to actually accomplish? It's not clear to me

Source filtering as @Christian_Dahlqvist says requires significant additional processing.

Are you aware of the fields filter which is most likely much faster...

system · June 21, 2023, 2:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Possible optimisations for large _source documents Elasticsearch	7	595	July 5, 2017
Es query slow with simple filter, and profile didn't show it Elasticsearch	7	831	June 20, 2020
_source.excludes/includes makes query 2 times slower Elasticsearch	3	1417	April 2, 2020
Performance issues around _source and large page size Elasticsearch	5	1001	July 5, 2017
Elasticsearch fast query but slow response time when retrieving _source even if nested fields are in _source_exclude Elasticsearch	2	1713	July 20, 2018

'_source' filtering is slower than query without '_source' field

Related topics