Terms aggregation with ICU multi-field and arrays

mhugo · February 22, 2022, 1:03pm

Hi.
I struggle to sort buckets of a terms aggregation based on ICU collation to respect the ordering of accented words. And especially when some the text field to sort is indexed as arrays.

I started with a simple term aggregation on a text field. Of course such fields cannot be used in aggregations, so I added a keyword as "sub-field" to the mapping. I also added the collation key from ICU to be able to correctly sort texts. I ended up with something like:

PUT /test_icu
{
    "mappings": {
        "properties": {
            "t": {
                "type": "text",
                "fields": {
                    "sort": {
                        "type": "icu_collation_keyword",
                        "index": false  // I had to add this to get non emtpy results
                    }
                }
            }
        }
    }
}

Then adding some simple-valued documents with two simple strings "x" and "î" (i with circumflex accent which should be ordered before x)

POST /test_icu/_doc/1
{
    "t": "x"
}

#
POST /test_icu/_doc/2
{
    "t": "î"
}

Now the aggregation query:

GET /test_icu/_search
{
    "aggs": {
        "tt": {
            "terms": {
                "field": "t.sort"
            }
        }
    },
    "size": 0
}

... gives bucket names that are not human readable (collation keys):

{
"took":4,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}
"hits":{"total":{"value":2,"relation":"eq"},"max_score":null,"hits":[]}
"aggregations": {
  "tt":{
    "doc_count_error_upper_bound":0,
    "sum_other_doc_count":0,
    "buckets":[
        {"key":"ᴀ兣䀠怀","doc_count":1},
        {"key":"Ⰰ䅀₠\0\0","doc_count":1}]}}}

In order to get the corresponding original string from the icu collation key, I added a sub-aggregation on a "raw" (keyword) sub field (new mapping first, then the query)

PUT /test_icu
{
    "mappings": {
        "properties": {
            "t": {
                "type": "text",
                "fields": {
                    "sort": {
                        "type": "icu_collation_keyword",
                        "index": false
                    },
                    "raw": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

GET /test_icu/_search
{
    "aggs": {
        "tt": {
            "aggs": {
                "traw": {
                    "terms": {
                        "field": "t.raw"
                    }
                }
            },
            "terms": {
                "field": "t.sort"
            }
        }
    },
    "size": 0
}

This gives nice results where buckets have a sub bucket key made of the original string (I've replaced non printable characters with "<non_printable>"):

{
    "aggregations": {
        "tt": {
            "buckets": [
                {
                    "traw": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "î"
                            }
                        ],
                        "sum_other_doc_count": 0,
                        "doc_count_error_upper_bound": 0
                    },
                    "doc_count": 1,
                    "key": "<non_printable>"
                },
                {
                    "traw": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "x"
                            }
                        ],
                        "sum_other_doc_count": 0,
                        "doc_count_error_upper_bound": 0
                    },
                    "doc_count": 1,
                    "key": "<non_printable>"
                }
            ],
            "sum_other_doc_count": 0,
            "doc_count_error_upper_bound": 0
        }
    },
    "hits": {
        "hits": [],
        "max_score": null,
        "total": {
            "relation": "eq",
            "value": 2
        }
    },
    "_shards": {
        "failed": 0,
        "skipped": 0,
        "successful": 1,
        "total": 1
    },
    "timed_out": false,
    "took": 2
}

Now, what If a new document has a mutli value ? (array):

POST /test_icu/_doc/3
{
    "t": [ "à", "z"]
}

It will result in multiple sub buckets for each bucket where it becomes impossible to know which sub bucket key corresponds to the parent bucket key ...

{
    "aggregations": {
        "tt": {
            "buckets": [
                {
                    "traw": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "z"
                            },
                            {
                                "doc_count": 1,
                                "key": "à"
                            }
                        ],
                        "sum_other_doc_count": 0,
                        "doc_count_error_upper_bound": 0
                    },
                    "doc_count": 1,
                    "key": "<non_printable1>"
                },
                {
                    "traw": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "z"
                            },
                            {
                                "doc_count": 1,
                                "key": "à"
                            }
                        ],
                        "sum_other_doc_count": 0,
                        "doc_count_error_upper_bound": 0
                    },
                    "doc_count": 1,
                    "key": "<non_printable2>"
                }
            ],
            "sum_other_doc_count": 0,
            "doc_count_error_upper_bound": 0
        }
    },
    "hits": {
        "hits": [],
        "max_score": null,
        "total": {
            "relation": "eq",
            "value": 1
        }
    },
    "_shards": {
        "failed": 0,
        "skipped": 0,
        "successful": 1,
        "total": 1
    },
    "timed_out": false,
    "took": 852
}

I did not find a proper solution to this.
I've tried a hack where the aggregation is done on the concatenation of the collation key and the raw string, thanks to a script:

GET /test_icu/_search
{
    "runtime_mappings": {
        "myt": {
            "type": "keyword",
            "script": "for (int i=0;i<doc['t.sort'].size();i++){emit(doc['t.sort'][i] + '||' + doc['t.raw'][i]);}"
        }
    },
    "size": 0,
    "aggs": {
        "tt": {
            "terms": {
                "field": "myt"
            }
        }
    }
}

... but this does not work as I would expect, keys are not correctly ordered. It seems there is no guarantee that doc['t.sort'][i] correspond to doc['t.raw'][i]. The two arrays may be ordered differently ...

Is there a way to address the need of a language-aware sorting of bucket keys ?

Alternatively, is there a way to fix access to the different arrays of a multifield so that we get elements in a fixed order ?

Thanks in advance for your help.
Hugo

mhugo · February 22, 2022, 1:07pm

Sorry, duplicate of ICU sorting of terms aggregation with multi-valued fields. This one can be closed

system · March 22, 2022, 1:08pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ICU sorting of terms aggregation with multi-valued fields Elasticsearch runtime-fields	7	700	March 24, 2022
Unexpected Behavior with ICU Collation Keyword Sorting Elastic Search	1	21	December 9, 2024
Is case-insensitive sort on keyword aggregation possible ? (ES 6.4) Elasticsearch	6	2719	November 16, 2018
Icu_collation as keyword normalizer Elasticsearch	3	1123	March 13, 2017
[Ann] ICU facet allows sorting based on ICU collations Elasticsearch	4	458	July 6, 2017

Terms aggregation with ICU multi-field and arrays

Related topics