Hi.
I struggle to sort buckets of a terms aggregation based on ICU collation to respect the ordering of accented words. And especially when some the text field to sort is indexed as arrays.
I started with a simple term aggregation on a text field. Of course such fields cannot be used in aggregations, so I added a keyword as "sub-field" to the mapping. I also added the collation key from ICU to be able to correctly sort texts. I ended up with something like:
PUT /test_icu
{
"mappings": {
"properties": {
"t": {
"type": "text",
"fields": {
"sort": {
"type": "icu_collation_keyword",
"index": false // I had to add this to get non emtpy results
}
}
}
}
}
}
Then adding some simple-valued documents with two simple strings "x" and "î" (i with circumflex accent which should be ordered before x)
POST /test_icu/_doc/1
{
"t": "x"
}
#
POST /test_icu/_doc/2
{
"t": "î"
}
Now the aggregation query:
GET /test_icu/_search
{
"aggs": {
"tt": {
"terms": {
"field": "t.sort"
}
}
},
"size": 0
}
... gives bucket names that are not human readable (collation keys):
{
"took":4,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}
"hits":{"total":{"value":2,"relation":"eq"},"max_score":null,"hits":[]}
"aggregations": {
"tt":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
{"key":"ᴀ兣䀠怀","doc_count":1},
{"key":"Ⰰ䅀₠\0\0","doc_count":1}]}}}
In order to get the corresponding original string from the icu collation key, I added a sub-aggregation on a "raw" (keyword) sub field (new mapping first, then the query)
PUT /test_icu
{
"mappings": {
"properties": {
"t": {
"type": "text",
"fields": {
"sort": {
"type": "icu_collation_keyword",
"index": false
},
"raw": {
"type": "keyword"
}
}
}
}
}
}
GET /test_icu/_search
{
"aggs": {
"tt": {
"aggs": {
"traw": {
"terms": {
"field": "t.raw"
}
}
},
"terms": {
"field": "t.sort"
}
}
},
"size": 0
}
This gives nice results where buckets have a sub bucket key made of the original string (I've replaced non printable characters with "<non_printable>"):
{
"aggregations": {
"tt": {
"buckets": [
{
"traw": {
"buckets": [
{
"doc_count": 1,
"key": "î"
}
],
"sum_other_doc_count": 0,
"doc_count_error_upper_bound": 0
},
"doc_count": 1,
"key": "<non_printable>"
},
{
"traw": {
"buckets": [
{
"doc_count": 1,
"key": "x"
}
],
"sum_other_doc_count": 0,
"doc_count_error_upper_bound": 0
},
"doc_count": 1,
"key": "<non_printable>"
}
],
"sum_other_doc_count": 0,
"doc_count_error_upper_bound": 0
}
},
"hits": {
"hits": [],
"max_score": null,
"total": {
"relation": "eq",
"value": 2
}
},
"_shards": {
"failed": 0,
"skipped": 0,
"successful": 1,
"total": 1
},
"timed_out": false,
"took": 2
}
Now, what If a new document has a mutli value ? (array):
POST /test_icu/_doc/3
{
"t": [ "à", "z"]
}
It will result in multiple sub buckets for each bucket where it becomes impossible to know which sub bucket key corresponds to the parent bucket key ...
{
"aggregations": {
"tt": {
"buckets": [
{
"traw": {
"buckets": [
{
"doc_count": 1,
"key": "z"
},
{
"doc_count": 1,
"key": "à"
}
],
"sum_other_doc_count": 0,
"doc_count_error_upper_bound": 0
},
"doc_count": 1,
"key": "<non_printable1>"
},
{
"traw": {
"buckets": [
{
"doc_count": 1,
"key": "z"
},
{
"doc_count": 1,
"key": "à"
}
],
"sum_other_doc_count": 0,
"doc_count_error_upper_bound": 0
},
"doc_count": 1,
"key": "<non_printable2>"
}
],
"sum_other_doc_count": 0,
"doc_count_error_upper_bound": 0
}
},
"hits": {
"hits": [],
"max_score": null,
"total": {
"relation": "eq",
"value": 1
}
},
"_shards": {
"failed": 0,
"skipped": 0,
"successful": 1,
"total": 1
},
"timed_out": false,
"took": 852
}
I did not find a proper solution to this.
I've tried a hack where the aggregation is done on the concatenation of the collation key and the raw string, thanks to a script:
GET /test_icu/_search
{
"runtime_mappings": {
"myt": {
"type": "keyword",
"script": "for (int i=0;i<doc['t.sort'].size();i++){emit(doc['t.sort'][i] + '||' + doc['t.raw'][i]);}"
}
},
"size": 0,
"aggs": {
"tt": {
"terms": {
"field": "myt"
}
}
}
}
... but this does not work as I would expect, keys are not correctly ordered. It seems there is no guarantee that doc['t.sort'][i]
correspond to doc['t.raw'][i]
. The two arrays may be ordered differently ...
Is there a way to address the need of a language-aware sorting of bucket keys ?
Alternatively, is there a way to fix access to the different arrays of a multifield so that we get elements in a fixed order ?
Thanks in advance for your help.
Hugo