Slow terms aggregation speed on ~130M documents

dadoonet · February 8, 2018, 7:47am

So you are using doc_values here. Which means that the OS Filesystem cache will be mostly used in that case.

It can obviously take some time to read the data the first time but then you will most likely benefit from it.

jdgenio · February 8, 2018, 7:56am

If I'm not mistaken, OS level caching is what slows down the search on the first try? So in a way, we are sacrificing the first search speed for the performance of the subsequent searches? If that's the case, then I agree that it's a good tradeoff.

Back to the main question, will preloading fielddata negate this first load problem or is setting up a CRON job the only viable solution?

dadoonet · February 8, 2018, 8:10am

Yes.

Again. You are not using fielddata here right? If your question is "Will running some queries before the user be helpful?" the answer is yes.

jdgenio · February 8, 2018, 8:18am

What I mean is that if we add:

"fielddata": {
    "loading": "eager"
}

to the keywords we need suggestions on, would it help lessen the load times?

e.g.:

from

"Consignee":{  
    "properties":{
        "Name":{  
            "type":"text",
            "fields":{  
                "untouched":{  
                    "type":"keyword"
                }
            }
        }
    }
}

to

"Consignee":{  
    "properties":{
        "Name":{  
            "type":"text",
            "fields":{  
                "untouched":{  
                    "type":"keyword"
                },
                "fielddata": {
                    "loading": "eager"
                }
            }
        }
    }
}

dadoonet · February 8, 2018, 8:31am

You might to look at https://www.elastic.co/guide/en/elasticsearch/reference/6.2/eager-global-ordinals.html

This can slow down your indexing process but be faster at search time may be.

I'm sure @mvg might have lot of other ideas...

Mark_Harwood · February 8, 2018, 8:47am

If you have a small set of results try an execution hint of “map” https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-execution-hint

jdgenio · February 8, 2018, 9:31am

Hi Mark, I haven't done some proper testing with benchmarks yet but so far, it was faster for my queries. What is the reason for the speed increase and what are the downsides to it? Looking at the docs, it implies that high result queries will be affected by the map execution hint. However, some of my queries have around a million results with no visible degradation compared to the default global_ordinals query. What is considered a high result query?

Also, thank you @dadoonet. Talking to you was a great learning experience.

Mark_Harwood · February 8, 2018, 9:45am

Global ordinals execution mode aggregates information using numbers that each represent unique strings in the index. This means the memory overhead for labelling buckets is smaller than the length of the strings they replace but there's an overhead in the conversion.

map execution tags the buckets being aggregated directly with the indexed value which could be long in some cases (imagine URL strings). If your query matches a lot of docs that's a lot of strings to hold in memory before they are pruned to your top N selection.

Can you say how much faster?

jdgenio · February 8, 2018, 9:57am

We close the ES servers after office hours so I can't make queries right now to verify but earlier, the normal searches were taking around a little higher than 4 seconds while the one with the map execution hint took around 3.5 seconds, but some of the lower result queries were taking 2 or less. I'd have to perform a proper benchmark tomorrow and get back to you.

That makes a lot of sense, thank you for the explanation. If I understand correctly, @dadoonet's recommendation would then negate the global ordinal conversion overhead since it would eagerly build the global 'hashtable'?

Mark_Harwood · February 8, 2018, 10:06am

If you are only using map execution hint on the field there is no need for global ordinals to be built so any "eager" loading would be wasted effort.

jdgenio · February 8, 2018, 10:08am

Yes, but using global ordinals while preloading it will be the better choice rather than use the map execution hint?

Mark_Harwood · February 8, 2018, 10:15am

On a static index otherwise you are adding cost to every refresh when new content is added.

jdgenio · February 8, 2018, 10:48am

I see, it is also indicated on the docs as well. We would like to prioritize search speed over indexing speed. Worst comes to worst, we could schedule the document updates to a single bulk request when there is low user activity so there would be no daily refresh.

Thank you very much for the insights, I think you already covered most if not all of my concerns. It has been a pleasure talking to you guys. Best of luck to the ElasticSearch team.

system · March 8, 2018, 10:48am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Alex_Marquardt · May 10, 2019, 6:58pm

If the field which you are aggregating on is high cardinality, then the slow response may be due to computation of global ordinals (as @Mark_Harwood mentioned above). For future reference, the following blog post gives an overview of the underlying cause of slowness and several approaches to address this issue.

Topic		Replies	Views
Slow aggregation no matter the size of the result set Elasticsearch	3	477	October 26, 2018
Unique Term Values via Aggregation - Performance Considerations Elasticsearch	4	1201	January 17, 2017
Elasticsearch terms aggregation taking 5 seconds on 5 million documents Elasticsearch	7	1996	August 19, 2019
Elasticsearch Aggregations taking a long time Elasticsearch	5	2382	July 5, 2017
Slow searches on a cluster Elasticsearch	3	877	July 5, 2017

Slow terms aggregation speed on ~130M documents

Related topics