ICU sorting of terms aggregation with multi-valued fields

mhugo · February 22, 2022, 11:29am

Hi.
I struggle to sort buckets of a terms aggregation based on ICU collation to respect the ordering of accented words. And especially when the text field to sort is indexed with arrays.

I started with a simple term aggregation on a text field. Of course such fields cannot be used in aggregations, so I added a keyword as "sub-field" to the mapping. I also added the collation key from ICU to be able to correctly sort texts. I ended up with something like:

PUT /test_icu
{
    "mappings": {
        "properties": {
            "t": {
                "type": "text",
                "fields": {
                    "sort": {
                        "type": "icu_collation_keyword",
                        "index": false  // I had to add this to get non emtpy results
                    }
                }
            }
        }
    }
}

Then adding some single-valued documents with two simple strings "x" and "î" (i with circumflex accent which should be ordered before x)

POST /test_icu/_doc/1
{
    "t": "x"
}

#
POST /test_icu/_doc/2
{
    "t": "î"
}

Now the aggregation query:

GET /test_icu/_search
{
    "aggs": {
        "tt": {
            "terms": {
                "field": "t.sort"
            }
        }
    },
    "size": 0
}

... gives bucket names that are not human readable (collation keys):

{
"took":4,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}
"hits":{"total":{"value":2,"relation":"eq"},"max_score":null,"hits":[]}
"aggregations": {
  "tt":{
    "doc_count_error_upper_bound":0,
    "sum_other_doc_count":0,
    "buckets":[
        {"key":"ᴀ兣䀠怀","doc_count":1},
        {"key":"Ⰰ䅀₠\0\0","doc_count":1}]}}}

In order to get the corresponding original string from the icu collation key, I added a sub-aggregation on a "raw" (keyword) sub field (new mapping first, then the query)

PUT /test_icu
{
    "mappings": {
        "properties": {
            "t": {
                "type": "text",
                "fields": {
                    "sort": {
                        "type": "icu_collation_keyword",
                        "index": false
                    },
                    "raw": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

GET /test_icu/_search
{
    "aggs": {
        "tt": {
            "aggs": {
                "traw": {
                    "terms": {
                        "field": "t.raw"
                    }
                }
            },
            "terms": {
                "field": "t.sort"
            }
        }
    },
    "size": 0
}

This gives nice results where buckets have a sub bucket key made of the original string (I've replaced non printable characters with "<non_printable>"):

{
    "aggregations": {
        "tt": {
            "buckets": [
                {
                    "traw": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "î"
                            }
                        ],
                        "sum_other_doc_count": 0,
                        "doc_count_error_upper_bound": 0
                    },
                    "doc_count": 1,
                    "key": "<non_printable>"
                },
                {
                    "traw": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "x"
                            }
                        ],
                        "sum_other_doc_count": 0,
                        "doc_count_error_upper_bound": 0
                    },
                    "doc_count": 1,
                    "key": "<non_printable>"
                }
            ],
            "sum_other_doc_count": 0,
            "doc_count_error_upper_bound": 0
        }
    },
    "hits": {
        "hits": [],
        "max_score": null,
        "total": {
            "relation": "eq",
            "value": 2
        }
    },
    "_shards": {
        "failed": 0,
        "skipped": 0,
        "successful": 1,
        "total": 1
    },
    "timed_out": false,
    "took": 2
}

So far, so good.
Now, what If a new document has a mutli value ? (array):

POST /test_icu/_doc/3
{
    "t": ["à", "z"]
}

... now the query gives multiple sub buckets and we cannot tell which original value correspond to which collation key (omitting the first two documents for clarity):

{
    "aggregations": {
        "tt": {
            "buckets": [
                {
                    "traw": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "z"
                            },
                            {
                                "doc_count": 1,
                                "key": "à"
                            }
                        ],
                        "sum_other_doc_count": 0,
                        "doc_count_error_upper_bound": 0
                    },
                    "doc_count": 1,
                    "key": "<non_printable1>"
                },
                {
                    "traw": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "z"
                            },
                            {
                                "doc_count": 1,
                                "key": "à"
                            }
                        ],
                        "sum_other_doc_count": 0,
                        "doc_count_error_upper_bound": 0
                    },
                    "doc_count": 1,
                    "key": "<non_printable2>"
                }
            ],
            "sum_other_doc_count": 0,
            "doc_count_error_upper_bound": 0
        }
    },
    "hits": {
        "hits": [],
        "max_score": null,
        "total": {
            "relation": "eq",
            "value": 1
        }
    },
    "_shards": {
        "failed": 0,
        "skipped": 0,
        "successful": 1,
        "total": 1
    },
    "timed_out": false,
    "took": 3
}

I cannot find a proper solution to this.

I've tried to aggregate and sort on the concatenation of the icu collation and the original string, with something like this:

GET /test_icu/_search
{
    "runtime_mappings": {
        "myt": {
            "type": "keyword",
            "script": "for (int i=0;i<doc['t.sort'].size();i++){emit(doc['t.sort'][i] + '||' + doc['t.raw'][i]);}"
        }
    },
    "size": 0,
    "aggs": {
        "tt": {
            "terms": {
                "field": "myt"
            }
        }
    }
}

However, results are not what I would expect (they are not sorted correctly). I think this is because there is no guarantee that doc['t.sort'][i] corresponds to doc['t.raw'][i], both arrays may be in different arbitrary orders ...

Is there any way to address this issue ?

Thanks in advance for your help.
Hugo

Tomo_M · February 22, 2022, 5:10pm

If you want to sort terms in terms aggregation, you have to specify "order" parameter.

How about

GET /test_icu/_search
{
    "runtime_mappings": {
        "myt": {
            "type": "keyword",
            "script": "for (int i=0;i<doc['t.sort'].size();i++){emit(doc['t.sort'][i] + '||' + doc['t.raw'][i]);}"
        }
    },
    "size": 0,
    "aggs": {
        "tt": {
            "terms": {
                "field": "myt",
                "order": { "_key": "asc" }
            }
        }
    }
}

mhugo · February 23, 2022, 3:28pm

Sure, thanks.
But the problem comes from the fact that array values in a multi field cannot be accessed in a consistent order in scripts.
Let's simplify a bit to illustrate. With this simple query (I've added encodeBase64 so that we can print collation keys):

GET /test_icu/_search
{
    "runtime_mappings": {
        "myt": {
            "type": "keyword",
            "script": "for (int i=0;i<doc['t.sort'].size();i++){emit(doc['t.sort'][i].encodeBase64() + '||' + doc['t.raw'][i]);}"
        }
    },
    "_source": {},
    "fields": [{"field": "myt"}]
}

... I get the following result in "fields":

                "fields": {
                    "myt": [
                        "KgFFwooBBgA=||z",
                        "XAEFAQUA||à"
                    ]
                },

... which would mean "à" has an collation key of "XAEFAQUA" (in base64). However, this is wrong, the collation key of "à" is "KgFFwooBBgA=" (confirmed manually with pyICU). So my conclusion is that this concatenation hack cannot work.
So I cannot find a way to sort term aggregations respecting the language ordering when array values are used with this technique.

The only alternative would be to do this concatenation at index time by the client ... or patch the analysis-icu aggregation somehow ...

Tomo_M · February 23, 2022, 3:46pm

What will happen if you use params['_source']['t.sort'] rather than doc['t.sort']?
This could raise performance issue, but the order should be strictly the same as the source you created.

Tomo_M · February 23, 2022, 3:49pm

Hm, it will not work because there could not be the field of t.sort nor t.raw field in the _source...

Tomo_M · February 23, 2022, 3:54pm

Anyway, why you need bucket sorting?

mhugo · February 24, 2022, 3:35pm

Thanks for your suggestion, but yes, it does not work, since t.sort and t.raw are not part of _source.

Anyway, why you need bucket sorting?

Well I just need to sort results of a term aggregations. But my use case involves an order that must respect the language aware order of texts. And on top of that, it should work with array values.

I am considering writing a plugin just for this use case, but it seems a bit overkill to me ...

system · March 24, 2022, 3:36pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Terms aggregation with ICU multi-field and arrays Elasticsearch runtime-fields	2	688	March 22, 2022
Unexpected Behavior with ICU Collation Keyword Sorting Elastic Search	1	21	December 9, 2024
Is case-insensitive sort on keyword aggregation possible ? (ES 6.4) Elasticsearch	6	2719	November 16, 2018
[Ann] ICU facet allows sorting based on ICU collations Elasticsearch	4	458	July 6, 2017
Icu_collation as keyword normalizer Elasticsearch	3	1123	March 13, 2017

ICU sorting of terms aggregation with multi-valued fields

Related topics