Slow terms aggregation speed on ~130M documents

So you are using doc_values here. Which means that the OS Filesystem cache will be mostly used in that case.

It can obviously take some time to read the data the first time but then you will most likely benefit from it.

If I'm not mistaken, OS level caching is what slows down the search on the first try? So in a way, we are sacrificing the first search speed for the performance of the subsequent searches? If that's the case, then I agree that it's a good tradeoff.

Back to the main question, will preloading fielddata negate this first load problem or is setting up a CRON job the only viable solution?

Yes.

Again. You are not using fielddata here right? If your question is "Will running some queries before the user be helpful?" the answer is yes.

What I mean is that if we add:

"fielddata": {
    "loading": "eager"
}

to the keywords we need suggestions on, would it help lessen the load times?

e.g.:

from

"Consignee":{  
    "properties":{
        "Name":{  
            "type":"text",
            "fields":{  
                "untouched":{  
                    "type":"keyword"
                }
            }
        }
    }
}

to

"Consignee":{  
    "properties":{
        "Name":{  
            "type":"text",
            "fields":{  
                "untouched":{  
                    "type":"keyword"
                },
                "fielddata": {
                    "loading": "eager"
                }
            }
        }
    }
}

You might to look at https://www.elastic.co/guide/en/elasticsearch/reference/6.2/eager-global-ordinals.html

This can slow down your indexing process but be faster at search time may be.

I'm sure @mvg might have lot of other ideas... :smiley:

If you have a small set of results try an execution hint of “map” https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-execution-hint

1 Like

Hi Mark, I haven't done some proper testing with benchmarks yet but so far, it was faster for my queries. What is the reason for the speed increase and what are the downsides to it? Looking at the docs, it implies that high result queries will be affected by the map execution hint. However, some of my queries have around a million results with no visible degradation compared to the default global_ordinals query. What is considered a high result query?

Also, thank you @dadoonet. Talking to you was a great learning experience.

Global ordinals execution mode aggregates information using numbers that each represent unique strings in the index. This means the memory overhead for labelling buckets is smaller than the length of the strings they replace but there's an overhead in the conversion.

map execution tags the buckets being aggregated directly with the indexed value which could be long in some cases (imagine URL strings). If your query matches a lot of docs that's a lot of strings to hold in memory before they are pruned to your top N selection.

Can you say how much faster?

We close the ES servers after office hours so I can't make queries right now to verify but earlier, the normal searches were taking around a little higher than 4 seconds while the one with the map execution hint took around 3.5 seconds, but some of the lower result queries were taking 2 or less. I'd have to perform a proper benchmark tomorrow and get back to you.

That makes a lot of sense, thank you for the explanation. If I understand correctly, @dadoonet's recommendation would then negate the global ordinal conversion overhead since it would eagerly build the global 'hashtable'?

If you are only using map execution hint on the field there is no need for global ordinals to be built so any "eager" loading would be wasted effort.

Yes, but using global ordinals while preloading it will be the better choice rather than use the map execution hint?

On a static index otherwise you are adding cost to every refresh when new content is added.

2 Likes

I see, it is also indicated on the docs as well. We would like to prioritize search speed over indexing speed. Worst comes to worst, we could schedule the document updates to a single bulk request when there is low user activity so there would be no daily refresh.

Thank you very much for the insights, I think you already covered most if not all of my concerns. It has been a pleasure talking to you guys. Best of luck to the ElasticSearch team.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

If the field which you are aggregating on is high cardinality, then the slow response may be due to computation of global ordinals (as @Mark_Harwood mentioned above). For future reference, the following blog post gives an overview of the underlying cause of slowness and several approaches to address this issue.