Issue with elasticsearch-analysis-icu plugin

yesu · May 19, 2017, 9:37am

I am trying to use elasticsearch-analysis-icu plugin to achieve case-insensitivity sort, data is getting sorted correctly but its returning data in china language for both sort field in queries and aggregation bucket. Is there any way to get it in English?

yesu · May 22, 2017, 6:25am

If anyone know sol for this problem please help me.

dadoonet · May 22, 2017, 6:37am

But if you indexed chinese why would it come back as english?

You would have a better chance to get help if you follow this guide and provide examples.

yesu · May 22, 2017, 8:38am

Thanks for reply, Indexed data which is in English only, using plugin capabilities as below, am i doing anything wrong.

PUT /my_index
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "case_insensitive_sort": { 
          "type":     "icu_collation",
          "language": "en",
          "country":  "US"
        }
      },
      "analyzer": {
        "case_insensitive_sort": { 
          "tokenizer": "keyword",
          "filter":  [ "case_insensitive_sort" ]
        }
      }
    }
  }
}

PUT /my_index/_mapping/user
{
  "properties": {
    "name": {
      "type": "string",
      "fields": {
        "sort": {
          "type": "string",
          "analyzer": "case_insensitive_sort"
        }
      }
    }
  }
}

PUT /my_index/user/_bulk
{ "index": { "_id": 1 }}
{ "name": "Boffey" }
{ "index": { "_id": 2 }}
{ "name": "BROWN" }
{ "index": { "_id": 3 }}
{ "name": "bailey" }
{ "index": { "_id": 4 }}
{ "name": "Böhm" }

GET /my_index/user/_search?sort=name.sort
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": null,
      "hits": [
         {
            "_index": "my_index",
            "_type": "user",
            "_id": "3",
            "_score": null,
            "_source": {
               "name": "bailey"
            },
            "sort": [
               "ᖔ乏昫တ倈⠀\u0001"
            ]
         },
         {
            "_index": "my_index",
            "_type": "user",
            "_id": "1",
            "_score": null,
            "_source": {
               "name": "Boffey"
            },
            "sort": [
               "ᖢ䳌昫တ倎瀤\u0000\u0000"
            ]
         },
         {
            "_index": "my_index",
            "_type": "user",
            "_id": "4",
            "_score": null,
            "_source": {
               "name": "Böhm"
            },
            "sort": [
               "ᖢ䷐ 䥠『瀠\u0000\u0000"
            ]
         },
         {
            "_index": "my_index",
            "_type": "user",
            "_id": "2",
            "_score": null,
            "_source": {
               "name": "BROWN"
            },
            "sort": [
               "ᖥ兕⡠႐໦獳䆸\u0000\u0000"
            ]
         }
      ]
   }
}

In sort field, content returned in chinise.

dadoonet · May 22, 2017, 11:23am

Are you running on a Chinese computer? I mean are using UTF8 or something else?

yesu · May 22, 2017, 11:43am

NO, I am not running on a Chinese computer and not explicitly using UTF8 or anything like that. In _source field, data is coming as expected in English but its coming in chinese in sort field.

mattweber · May 22, 2017, 2:28pm

What you see if the result of the ICU collation which will not be the
original text. It is meant only for sorting purpose only. Would would use
the value from the _source field for display.

Thanks,
Matt Weber

yesu · May 23, 2017, 6:40am

When I use bucket aggregation, can not use value from _source field. Please look at below query and response and please suggest if there any way to get aggregation result in English.

POST /my_index/user/_search
{
   "size": 0,
   "aggs": {
      "result": {
         "terms": {
            "field": "name.sort",
            "order": {
               "_term": "desc"
            }
         }
      }
   }
}

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "result": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "ᖥ兕⡠႐໦獳䆸\u0000\u0000",
               "doc_count": 1
            },
            {
               "key": "ᖢ䷐ 䥠『瀠\u0000\u0000",
               "doc_count": 1
            },
            {
               "key": "ᖢ䳌昫တ倎瀤\u0000\u0000",
               "doc_count": 1
            },
            {
               "key": "ᖔ乏昫တ倈⠀\u0001",
               "doc_count": 1
            }
         ]
      }
   }
}

dadoonet · May 29, 2017, 5:38am

According to the previous messages you pasted I believe you have been adding sub fields under name field.

I thought using type: string was not allowed anymore BTW.
Anyway, can you just create a subfield for aggs which does not use ICU? I mean an agg is not supposed to be analyzed (doc_values).

yesu · May 29, 2017, 6:52am

If I understood correctly, so far non-analyzed field only being used for agg. To achieve case-insensitivity sort(for display order in UI) on agg bucket values we trying to use ICU. As you are asking if we do not use ICU, is there any other way to achieve it?

Expected result as below:

POST /my_index/user/_search
{
   "aggs": {
      "result": {
         "terms": {
            "field": "name.sort",
            "order": {
               "_term": "asc"
            }
         }
      }
   }
}

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "result": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "bailey",
               "doc_count": 1
            },
            {
               "key": "Boffey",
               "doc_count": 1
            },
            {
               "key": "Böhm",
               "doc_count": 1
            },
            {
               "key": "BROWN",
               "doc_count": 1
            }
         ]
      }
   }
}

dadoonet · May 29, 2017, 7:06am

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

You can use normalizers for that. See https://www.elastic.co/guide/en/elasticsearch/reference/5.4/analysis-normalizers.html

yesu · May 29, 2017, 7:20am

Thanks for formatting technique, All earlier posts were updated.

yesu · May 30, 2017, 7:04am

Normalizers working fine for our requirement, But ES need to be upgraded from 2.3.5 to 5.3 or above.

PUT my_index1
{
   "settings": {
      "analysis": {
         "char_filter": {
            "quote": {
               "type": "mapping",
               "mappings": [
                  "« => \"",
                  "» => \""
               ]
            }
         },
         "normalizer": {
            "my_normalizer": {
               "type": "custom",
               "char_filter": [
                  "quote"
               ],
               "filter": [
                  "asciifolding"
               ]
            }
         }
      }
   },
   "mappings": {
      "type": {
         "properties": {
            "name": {
               "type": "string",
               "fields": {
                  "sort": {
                     "type": "keyword",
                     "normalizer": "my_normalizer"
                  }
               }
            }
         }
      }
   }
}

PUT /my_index1/type/_bulk
{ "index": { "_id": 1 }}
{ "name": "Boffey" }
{ "index": { "_id": 2 }}
{ "name": "BROWN" }
{ "index": { "_id": 3 }}
{ "name": "bailey" }
{ "index": { "_id": 4 }}
{ "name": "Böhm" }

POST /my_index1/type/_search
{
   "size": 0,
   "aggs": {
      "result": {
         "terms": {
            "field": "name.sort",
            "order": {
               "_term": "desc"
            }
         }
      }
   }
}

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "result": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "bailey",
               "doc_count": 1
            },
            {
               "key": "Bohm",
               "doc_count": 1
            },
            {
               "key": "Boffey",
               "doc_count": 1
            },
            {
               "key": "BROWN",
               "doc_count": 1
            }
         ]
      }
   }
}

Is there anyway to achieve this in ES 2.3.5 or ES should be upgraded?

dadoonet · May 30, 2017, 7:34am

In 2.x series, you can define an analyzer and apply it to a subfield as you did with the normalizer.
That should work the same way.

        "name": {
           "type": "string",
           "fields": {
              "sort": {
                 "type": "keyword",
                 "analyzer": "my_analyzer"
              }
           }
        }

yesu · May 30, 2017, 8:57am

I mistaken, its not working as expected either with "normalizer" or "analyser". Behaving like below.

PUT my_index1
{
   "settings": {
      "analysis": {
         "char_filter": {
            "quote": {
               "type": "mapping",
               "mappings": [
                  "« => \"",
                  "» => \""
               ]
            }
         },
         "analyzer": {
            "my_analyzer": {
               "type": "custom",
               "char_filter": [
                  "quote"
               ],
               "filter": [
                  "asciifolding"
               ],
               "tokenizer":"keyword"
            }
         }
      }
   },
   "mappings": {
      "type": {
         "properties": {
            "name": {
               "type": "string",
               "fields": {
                  "sort": {
                     "type": "string",
                     "analyzer": "my_analyzer"
                  }
               }
            }
         }
      }
   }
}

PUT /my_index1/type/_bulk
{"index":{"_id":1}}
{"name":"Boffey"}
{"index":{"_id":2}}
{"name":"BROWN"}
{"index":{"_id":3}}
{"name":"bailey"}
{"index":{"_id":4}}
{"name":"Böhm"}
{"index":{"_id":5}}
{"name":"bp"}
{"index":{"_id":6}}
{"name":"animal"}
{"index":{"_id":7}}
{"name":"Ample"}
{"index":{"_id":8}}
{"name":"category"}

POST /my_index1/type/_search
{
   "size": 0,
   "aggs": {
      "result": {
         "terms": {
            "field": "name.sort",
            "order": {
               "_term": "asc"
            }
         }
      }
   }
}

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 3,
      "successful": 3,
      "failed": 0
   },
   "hits": {
      "total": 8,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "result": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "Ample",
               "doc_count": 1
            },
            {
               "key": "BROWN",
               "doc_count": 1
            },
            {
               "key": "Boffey",
               "doc_count": 1
            },
            {
               "key": "Bohm",
               "doc_count": 1
            },
            {
               "key": "animal",
               "doc_count": 1
            },
            {
               "key": "bailey",
               "doc_count": 1
            },
            {
               "key": "bp",
               "doc_count": 1
            },
            {
               "key": "category",
               "doc_count": 1
            }
         ]
      }
   }
}

In earlier posts verified only words start with "B" and "b" could not identify issue. In this post its clear.

dadoonet · May 30, 2017, 9:50am

I don't see what is wrong TBH. Which result do you expect actually?

yesu · May 30, 2017, 10:10am

Below is expected order:

"buckets": [
            {
               "key": "animal",
               "doc_count": 1
            },
            {
               "key": "Ample",
               "doc_count": 1
            },
            {
               "key": "bailey",
               "doc_count": 1
            },
            {
               "key": "Boffey",
               "doc_count": 1
            },
            {
               "key": "bp",
               "doc_count": 1
            },
            {
               "key": "BROWN",
               "doc_count": 1
            },
            {
               "key": "category",
               "doc_count": 1
            }
         ]

dadoonet · May 30, 2017, 11:33am

I see. So here is a trick.

Run a first level of terms agg on a lowercased field (like name.lower).
Then run a sub agg on the normal field (like name).

As a result you will get all buckets ordered by lowercased name but you will get the actual value from the sub aggregation result.

yesu · May 30, 2017, 11:58am

Got it, Let us verify. Thanks a lot, hopefully it may solve our problem.

yesu · May 30, 2017, 12:18pm

It's working fine. Thanks David. First level of terms agg on a lowercased field(name.sort), sub agg on the non-analyzed field(name.not_ana).

PUT my_index1
{
   "settings": {
      "analysis": {
         "analyzer": {
            "analyzer_keyword": {
            "filter": "lowercase",
            "tokenizer": "keyword"
          }
         }
      }
   },
   "mappings": {
      "type": {
         "properties": {
            "name": {
               "type": "string",
               "fields": {
                  "sort": {
                     "type": "string",
                     "analyzer": "analyzer_keyword"
                  },
                  "not_ana": {
                     "type": "string",
                     "index": "not_analyzed"
                  }
               }
            }
         }
      }
   }
}

PUT /my_index1/type/_bulk
{"index":{"_id":1}}
{"name":"Boffey"}
{"index":{"_id":2}}
{"name":"BROWN"}
{"index":{"_id":3}}
{"name":"bailey"}
{"index":{"_id":4}}
{"name":"Böhm"}
{"index":{"_id":5}}
{"name":"bp"}
{"index":{"_id":6}}
{"name":"animal"}
{"index":{"_id":7}}
{"name":"Ample"}
{"index":{"_id":8}}
{"name":"category"}

POST /my_index1/type/_search
{
   "size" : 0,
   "aggs": {
      "result": {
         "terms": {
            "field": "name.sort",
            "order": {
               "_term": "asc"
            }
         },
         "aggs": { 
            "sub_result": { 
               "terms": {
                  "field": "name.not_ana" 
               }
            }
         }
      }
   }
}

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 3,
      "successful": 3,
      "failed": 0
   },
   "hits": {
      "total": 8,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "result": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "ample",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "Ample",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "animal",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "animal",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "bailey",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "bailey",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "boffey",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "Boffey",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "bp",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "bp",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "brown",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "BROWN",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "böhm",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "Böhm",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "category",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "category",
                        "doc_count": 1
                     }
                  ]
               }
            }
         ]
      }
   }
}

Topic		Replies	Views
Unexpected Behavior with ICU Collation Keyword Sorting Elastic Search	1	21	December 9, 2024
Case insensitive sorting problem Elasticsearch	4	3052	September 15, 2017
ICU, collation, and numbers Elasticsearch	3	614	July 5, 2017
Sorting Japanese words in Elastic Search Elasticsearch	1	518	July 10, 2018
How to sort Norwegian special characters with the ICU plugin? Elasticsearch	6	1392	July 6, 2017

Issue with elasticsearch-analysis-icu plugin

Related topics