Issue with elasticsearch-analysis-icu plugin


(Yesubabu B) #1

I am trying to use elasticsearch-analysis-icu plugin to achieve case-insensitivity sort, data is getting sorted correctly but its returning data in china language for both sort field in queries and aggregation bucket. Is there any way to get it in English?


(Yesubabu B) #2

If anyone know sol for this problem please help me.


(David Pilato) #3

But if you indexed chinese why would it come back as english?

You would have a better chance to get help if you follow this guide and provide examples.


(Yesubabu B) #4

Thanks for reply, Indexed data which is in English only, using plugin capabilities as below, am i doing anything wrong.

PUT /my_index
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "filter": {
        "case_insensitive_sort": { 
          "type":     "icu_collation",
          "language": "en",
          "country":  "US"
        }
      },
      "analyzer": {
        "case_insensitive_sort": { 
          "tokenizer": "keyword",
          "filter":  [ "case_insensitive_sort" ]
        }
      }
    }
  }
}

PUT /my_index/_mapping/user
{
  "properties": {
    "name": {
      "type": "string",
      "fields": {
        "sort": {
          "type": "string",
          "analyzer": "case_insensitive_sort"
        }
      }
    }
  }
}

PUT /my_index/user/_bulk
{ "index": { "_id": 1 }}
{ "name": "Boffey" }
{ "index": { "_id": 2 }}
{ "name": "BROWN" }
{ "index": { "_id": 3 }}
{ "name": "bailey" }
{ "index": { "_id": 4 }}
{ "name": "Böhm" }

GET /my_index/user/_search?sort=name.sort
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": null,
      "hits": [
         {
            "_index": "my_index",
            "_type": "user",
            "_id": "3",
            "_score": null,
            "_source": {
               "name": "bailey"
            },
            "sort": [
               "ᖔ乏昫တ倈⠀\u0001"
            ]
         },
         {
            "_index": "my_index",
            "_type": "user",
            "_id": "1",
            "_score": null,
            "_source": {
               "name": "Boffey"
            },
            "sort": [
               "ᖢ䳌昫တ倎瀤\u0000\u0000"
            ]
         },
         {
            "_index": "my_index",
            "_type": "user",
            "_id": "4",
            "_score": null,
            "_source": {
               "name": "Böhm"
            },
            "sort": [
               "ᖢ䷐ 䥠『瀠\u0000\u0000"
            ]
         },
         {
            "_index": "my_index",
            "_type": "user",
            "_id": "2",
            "_score": null,
            "_source": {
               "name": "BROWN"
            },
            "sort": [
               "ᖥ兕⡠႐໦獳䆸\u0000\u0000"
            ]
         }
      ]
   }
}

In sort field, content returned in chinise.


(David Pilato) #5

Are you running on a Chinese computer? I mean are using UTF8 or something else?


(Yesubabu B) #6

NO, I am not running on a Chinese computer and not explicitly using UTF8 or anything like that. In _source field, data is coming as expected in English but its coming in chinese in sort field.


(Matt Weber) #7

What you see if the result of the ICU collation which will not be the
original text. It is meant only for sorting purpose only. Would would use
the value from the _source field for display.

Thanks,
Matt Weber


(Yesubabu B) #8

When I use bucket aggregation, can not use value from _source field. Please look at below query and response and please suggest if there any way to get aggregation result in English.

POST /my_index/user/_search
{
   "size": 0,
   "aggs": {
      "result": {
         "terms": {
            "field": "name.sort",
            "order": {
               "_term": "desc"
            }
         }
      }
   }
}

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "result": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "ᖥ兕⡠႐໦獳䆸\u0000\u0000",
               "doc_count": 1
            },
            {
               "key": "ᖢ䷐ 䥠『瀠\u0000\u0000",
               "doc_count": 1
            },
            {
               "key": "ᖢ䳌昫တ倎瀤\u0000\u0000",
               "doc_count": 1
            },
            {
               "key": "ᖔ乏昫တ倈⠀\u0001",
               "doc_count": 1
            }
         ]
      }
   }
}

(David Pilato) #9

According to the previous messages you pasted I believe you have been adding sub fields under name field.

I thought using type: string was not allowed anymore BTW.
Anyway, can you just create a subfield for aggs which does not use ICU? I mean an agg is not supposed to be analyzed (doc_values).


(Yesubabu B) #10

If I understood correctly, so far non-analyzed field only being used for agg. To achieve case-insensitivity sort(for display order in UI) on agg bucket values we trying to use ICU. As you are asking if we do not use ICU, is there any other way to achieve it?

Expected result as below:

POST /my_index/user/_search
{
   "aggs": {
      "result": {
         "terms": {
            "field": "name.sort",
            "order": {
               "_term": "asc"
            }
         }
      }
   }
}

{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "result": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "bailey",
               "doc_count": 1
            },
            {
               "key": "Boffey",
               "doc_count": 1
            },
            {
               "key": "Böhm",
               "doc_count": 1
            },
            {
               "key": "BROWN",
               "doc_count": 1
            }
         ]
      }
   }
}

(David Pilato) #11

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

You can use normalizers for that. See https://www.elastic.co/guide/en/elasticsearch/reference/5.4/analysis-normalizers.html


(Yesubabu B) #12

Thanks for formatting technique, All earlier posts were updated.


(Yesubabu B) #13

Normalizers working fine for our requirement, But ES need to be upgraded from 2.3.5 to 5.3 or above.

PUT my_index1
{
   "settings": {
      "analysis": {
         "char_filter": {
            "quote": {
               "type": "mapping",
               "mappings": [
                  "« => \"",
                  "» => \""
               ]
            }
         },
         "normalizer": {
            "my_normalizer": {
               "type": "custom",
               "char_filter": [
                  "quote"
               ],
               "filter": [
                  "asciifolding"
               ]
            }
         }
      }
   },
   "mappings": {
      "type": {
         "properties": {
            "name": {
               "type": "string",
               "fields": {
                  "sort": {
                     "type": "keyword",
                     "normalizer": "my_normalizer"
                  }
               }
            }
         }
      }
   }
}

PUT /my_index1/type/_bulk
{ "index": { "_id": 1 }}
{ "name": "Boffey" }
{ "index": { "_id": 2 }}
{ "name": "BROWN" }
{ "index": { "_id": 3 }}
{ "name": "bailey" }
{ "index": { "_id": 4 }}
{ "name": "Böhm" }

POST /my_index1/type/_search
{
   "size": 0,
   "aggs": {
      "result": {
         "terms": {
            "field": "name.sort",
            "order": {
               "_term": "desc"
            }
         }
      }
   }
}

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 4,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "result": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "bailey",
               "doc_count": 1
            },
            {
               "key": "Bohm",
               "doc_count": 1
            },
            {
               "key": "Boffey",
               "doc_count": 1
            },
            {
               "key": "BROWN",
               "doc_count": 1
            }
         ]
      }
   }
}

Is there anyway to achieve this in ES 2.3.5 or ES should be upgraded?


(David Pilato) #14

In 2.x series, you can define an analyzer and apply it to a subfield as you did with the normalizer.
That should work the same way.

        "name": {
           "type": "string",
           "fields": {
              "sort": {
                 "type": "keyword",
                 "analyzer": "my_analyzer"
              }
           }
        }

(Yesubabu B) #15

I mistaken, its not working as expected either with "normalizer" or "analyser". Behaving like below.

PUT my_index1
{
   "settings": {
      "analysis": {
         "char_filter": {
            "quote": {
               "type": "mapping",
               "mappings": [
                  "« => \"",
                  "» => \""
               ]
            }
         },
         "analyzer": {
            "my_analyzer": {
               "type": "custom",
               "char_filter": [
                  "quote"
               ],
               "filter": [
                  "asciifolding"
               ],
               "tokenizer":"keyword"
            }
         }
      }
   },
   "mappings": {
      "type": {
         "properties": {
            "name": {
               "type": "string",
               "fields": {
                  "sort": {
                     "type": "string",
                     "analyzer": "my_analyzer"
                  }
               }
            }
         }
      }
   }
}

PUT /my_index1/type/_bulk
{"index":{"_id":1}}
{"name":"Boffey"}
{"index":{"_id":2}}
{"name":"BROWN"}
{"index":{"_id":3}}
{"name":"bailey"}
{"index":{"_id":4}}
{"name":"Böhm"}
{"index":{"_id":5}}
{"name":"bp"}
{"index":{"_id":6}}
{"name":"animal"}
{"index":{"_id":7}}
{"name":"Ample"}
{"index":{"_id":8}}
{"name":"category"}

POST /my_index1/type/_search
{
   "size": 0,
   "aggs": {
      "result": {
         "terms": {
            "field": "name.sort",
            "order": {
               "_term": "asc"
            }
         }
      }
   }
}

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 3,
      "successful": 3,
      "failed": 0
   },
   "hits": {
      "total": 8,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "result": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "Ample",
               "doc_count": 1
            },
            {
               "key": "BROWN",
               "doc_count": 1
            },
            {
               "key": "Boffey",
               "doc_count": 1
            },
            {
               "key": "Bohm",
               "doc_count": 1
            },
            {
               "key": "animal",
               "doc_count": 1
            },
            {
               "key": "bailey",
               "doc_count": 1
            },
            {
               "key": "bp",
               "doc_count": 1
            },
            {
               "key": "category",
               "doc_count": 1
            }
         ]
      }
   }
}

In earlier posts verified only words start with "B" and "b" could not identify issue. In this post its clear.


(David Pilato) #16

I don't see what is wrong TBH. Which result do you expect actually?


(Yesubabu B) #17

Below is expected order:

"buckets": [
            {
               "key": "animal",
               "doc_count": 1
            },
            {
               "key": "Ample",
               "doc_count": 1
            },
            {
               "key": "bailey",
               "doc_count": 1
            },
            {
               "key": "Boffey",
               "doc_count": 1
            },
            {
               "key": "bp",
               "doc_count": 1
            },
            {
               "key": "BROWN",
               "doc_count": 1
            },
            {
               "key": "category",
               "doc_count": 1
            }
         ]

(David Pilato) #18

I see. So here is a trick.

Run a first level of terms agg on a lowercased field (like name.lower).
Then run a sub agg on the normal field (like name).

As a result you will get all buckets ordered by lowercased name but you will get the actual value from the sub aggregation result.


(Yesubabu B) #19

Got it, Let us verify. Thanks a lot, hopefully it may solve our problem.


(Yesubabu B) #20

It's working fine. Thanks David. First level of terms agg on a lowercased field(name.sort), sub agg on the non-analyzed field(name.not_ana).

PUT my_index1
{
   "settings": {
      "analysis": {
         "analyzer": {
            "analyzer_keyword": {
            "filter": "lowercase",
            "tokenizer": "keyword"
          }
         }
      }
   },
   "mappings": {
      "type": {
         "properties": {
            "name": {
               "type": "string",
               "fields": {
                  "sort": {
                     "type": "string",
                     "analyzer": "analyzer_keyword"
                  },
                  "not_ana": {
                     "type": "string",
                     "index": "not_analyzed"
                  }
               }
            }
         }
      }
   }
}

PUT /my_index1/type/_bulk
{"index":{"_id":1}}
{"name":"Boffey"}
{"index":{"_id":2}}
{"name":"BROWN"}
{"index":{"_id":3}}
{"name":"bailey"}
{"index":{"_id":4}}
{"name":"Böhm"}
{"index":{"_id":5}}
{"name":"bp"}
{"index":{"_id":6}}
{"name":"animal"}
{"index":{"_id":7}}
{"name":"Ample"}
{"index":{"_id":8}}
{"name":"category"}

POST /my_index1/type/_search
{
   "size" : 0,
   "aggs": {
      "result": {
         "terms": {
            "field": "name.sort",
            "order": {
               "_term": "asc"
            }
         },
         "aggs": { 
            "sub_result": { 
               "terms": {
                  "field": "name.not_ana" 
               }
            }
         }
      }
   }
}

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 3,
      "successful": 3,
      "failed": 0
   },
   "hits": {
      "total": 8,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "result": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "ample",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "Ample",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "animal",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "animal",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "bailey",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "bailey",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "boffey",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "Boffey",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "bp",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "bp",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "brown",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "BROWN",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "böhm",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "Böhm",
                        "doc_count": 1
                     }
                  ]
               }
            },
            {
               "key": "category",
               "doc_count": 1,
               "sub_result": {
                  "doc_count_error_upper_bound": 0,
                  "sum_other_doc_count": 0,
                  "buckets": [
                     {
                        "key": "category",
                        "doc_count": 1
                     }
                  ]
               }
            }
         ]
      }
   }
}