ICU sorting of terms aggregation with multi-valued fields

Hi.
I struggle to sort buckets of a terms aggregation based on ICU collation to respect the ordering of accented words. And especially when the text field to sort is indexed with arrays.

I started with a simple term aggregation on a text field. Of course such fields cannot be used in aggregations, so I added a keyword as "sub-field" to the mapping. I also added the collation key from ICU to be able to correctly sort texts. I ended up with something like:

PUT /test_icu
{
    "mappings": {
        "properties": {
            "t": {
                "type": "text",
                "fields": {
                    "sort": {
                        "type": "icu_collation_keyword",
                        "index": false  // I had to add this to get non emtpy results
                    }
                }
            }
        }
    }
}

Then adding some single-valued documents with two simple strings "x" and "î" (i with circumflex accent which should be ordered before x)

POST /test_icu/_doc/1
{
    "t": "x"
}

#
POST /test_icu/_doc/2
{
    "t": "î"
}

Now the aggregation query:

GET /test_icu/_search
{
    "aggs": {
        "tt": {
            "terms": {
                "field": "t.sort"
            }
        }
    },
    "size": 0
}

... gives bucket names that are not human readable (collation keys):

{
"took":4,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}
"hits":{"total":{"value":2,"relation":"eq"},"max_score":null,"hits":[]}
"aggregations": {
  "tt":{
    "doc_count_error_upper_bound":0,
    "sum_other_doc_count":0,
    "buckets":[
        {"key":"ᴀ兣䀠怀","doc_count":1},
        {"key":"Ⰰ䅀₠\0\0","doc_count":1}]}}}

In order to get the corresponding original string from the icu collation key, I added a sub-aggregation on a "raw" (keyword) sub field (new mapping first, then the query)

PUT /test_icu
{
    "mappings": {
        "properties": {
            "t": {
                "type": "text",
                "fields": {
                    "sort": {
                        "type": "icu_collation_keyword",
                        "index": false
                    },
                    "raw": {
                        "type": "keyword"
                    }
                }
            }
        }
    }
}
GET /test_icu/_search
{
    "aggs": {
        "tt": {
            "aggs": {
                "traw": {
                    "terms": {
                        "field": "t.raw"
                    }
                }
            },
            "terms": {
                "field": "t.sort"
            }
        }
    },
    "size": 0
}

This gives nice results where buckets have a sub bucket key made of the original string (I've replaced non printable characters with "<non_printable>"):

{
    "aggregations": {
        "tt": {
            "buckets": [
                {
                    "traw": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "î"
                            }
                        ],
                        "sum_other_doc_count": 0,
                        "doc_count_error_upper_bound": 0
                    },
                    "doc_count": 1,
                    "key": "<non_printable>"
                },
                {
                    "traw": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "x"
                            }
                        ],
                        "sum_other_doc_count": 0,
                        "doc_count_error_upper_bound": 0
                    },
                    "doc_count": 1,
                    "key": "<non_printable>"
                }
            ],
            "sum_other_doc_count": 0,
            "doc_count_error_upper_bound": 0
        }
    },
    "hits": {
        "hits": [],
        "max_score": null,
        "total": {
            "relation": "eq",
            "value": 2
        }
    },
    "_shards": {
        "failed": 0,
        "skipped": 0,
        "successful": 1,
        "total": 1
    },
    "timed_out": false,
    "took": 2
}

So far, so good.
Now, what If a new document has a mutli value ? (array):

POST /test_icu/_doc/3
{
    "t": ["à", "z"]
}

... now the query gives multiple sub buckets and we cannot tell which original value correspond to which collation key (omitting the first two documents for clarity):

{
    "aggregations": {
        "tt": {
            "buckets": [
                {
                    "traw": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "z"
                            },
                            {
                                "doc_count": 1,
                                "key": "à"
                            }
                        ],
                        "sum_other_doc_count": 0,
                        "doc_count_error_upper_bound": 0
                    },
                    "doc_count": 1,
                    "key": "<non_printable1>"
                },
                {
                    "traw": {
                        "buckets": [
                            {
                                "doc_count": 1,
                                "key": "z"
                            },
                            {
                                "doc_count": 1,
                                "key": "à"
                            }
                        ],
                        "sum_other_doc_count": 0,
                        "doc_count_error_upper_bound": 0
                    },
                    "doc_count": 1,
                    "key": "<non_printable2>"
                }
            ],
            "sum_other_doc_count": 0,
            "doc_count_error_upper_bound": 0
        }
    },
    "hits": {
        "hits": [],
        "max_score": null,
        "total": {
            "relation": "eq",
            "value": 1
        }
    },
    "_shards": {
        "failed": 0,
        "skipped": 0,
        "successful": 1,
        "total": 1
    },
    "timed_out": false,
    "took": 3
}

I cannot find a proper solution to this.

I've tried to aggregate and sort on the concatenation of the icu collation and the original string, with something like this:

GET /test_icu/_search
{
    "runtime_mappings": {
        "myt": {
            "type": "keyword",
            "script": "for (int i=0;i<doc['t.sort'].size();i++){emit(doc['t.sort'][i] + '||' + doc['t.raw'][i]);}"
        }
    },
    "size": 0,
    "aggs": {
        "tt": {
            "terms": {
                "field": "myt"
            }
        }
    }
}

However, results are not what I would expect (they are not sorted correctly). I think this is because there is no guarantee that doc['t.sort'][i] corresponds to doc['t.raw'][i], both arrays may be in different arbitrary orders ...

Is there any way to address this issue ?

Thanks in advance for your help.
Hugo

1 Like

If you want to sort terms in terms aggregation, you have to specify "order" parameter.

How about

GET /test_icu/_search
{
    "runtime_mappings": {
        "myt": {
            "type": "keyword",
            "script": "for (int i=0;i<doc['t.sort'].size();i++){emit(doc['t.sort'][i] + '||' + doc['t.raw'][i]);}"
        }
    },
    "size": 0,
    "aggs": {
        "tt": {
            "terms": {
                "field": "myt",
                "order": { "_key": "asc" }
            }
        }
    }
}

Sure, thanks.
But the problem comes from the fact that array values in a multi field cannot be accessed in a consistent order in scripts.
Let's simplify a bit to illustrate. With this simple query (I've added encodeBase64 so that we can print collation keys):

GET /test_icu/_search
{
    "runtime_mappings": {
        "myt": {
            "type": "keyword",
            "script": "for (int i=0;i<doc['t.sort'].size();i++){emit(doc['t.sort'][i].encodeBase64() + '||' + doc['t.raw'][i]);}"
        }
    },
    "_source": {},
    "fields": [{"field": "myt"}]
}

... I get the following result in "fields":

                "fields": {
                    "myt": [
                        "KgFFwooBBgA=||z",
                        "XAEFAQUA||à"
                    ]
                },

... which would mean "à" has an collation key of "XAEFAQUA" (in base64). However, this is wrong, the collation key of "à" is "KgFFwooBBgA=" (confirmed manually with pyICU). So my conclusion is that this concatenation hack cannot work.
So I cannot find a way to sort term aggregations respecting the language ordering when array values are used with this technique.

The only alternative would be to do this concatenation at index time by the client ... or patch the analysis-icu aggregation somehow ...

What will happen if you use params['_source']['t.sort'] rather than doc['t.sort']?
This could raise performance issue, but the order should be strictly the same as the source you created.

Hm, it will not work because there could not be the field of t.sort nor t.raw field in the _source...

Anyway, why you need bucket sorting?

Thanks for your suggestion, but yes, it does not work, since t.sort and t.raw are not part of _source.

Anyway, why you need bucket sorting?

Well I just need to sort results of a term aggregations. But my use case involves an order that must respect the language aware order of texts. And on top of that, it should work with array values.

I am considering writing a plugin just for this use case, but it seems a bit overkill to me ...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.