Problem with icu_normalizer and acute accent

mconsoni · June 14, 2017, 7:57pm

Hi!

I'm having an issue with icu_normalizer and acute accents and I don't know if it's something I'm not seeing, a bug, a misconfiguration or what. I'm very new with ES.

I have an analyzer using a char_filter based on icu_normalizer and type nfkc.

Testing the analyzer, the acute accents are merged with the following character. Then I can search the documents with the original text, but not filter them.

I tested it with ES 1.7 in Debian and 5.4.1 in OSX, both with java 1.8.

A document with an attribute using acute accent is created, when I do a search, it is returned:

{
        "query": {
                "match": {
                        "photo.location.exact": {
                                "query": "Li´ege"
                        }
                }
        }
}

But when I try to filter by the original string, it's not returned:

{
    "size": 1000,
    "from": 0,
    "query": {
        "filtered": {
            "query": {
                "bool": {
                    "minimum_should_match": 1,
                    "must": [
                        {
                            "terms": {
                                "photo.location.exact": [
                                    "NY",
                                    "París",
                                    "Li´ege"
                                ]
                            }
                        }
                    ]
                }
            }
        }
    }
}

If I change Li´ege with the text returned by analyzing it, the document is returned.

My index:

{
  "documents": {
    "aliases": {},
    "mappings": {},
    "settings": {
      "index": {
        "number_of_shards": "5",
        "provided_name": "documents",
        "creation_date": "1497468479974",
        "analysis": {
          "filter": {
            "truncate_field": {
              "length": "1000",
              "type": "truncate"
            }
          },
          "analyzer": {
            "exact_analyzer": {
              "filter": [
                "truncate_field"
              ],
              "char_filter": "nfkc_normalizer",
              "type": "custom",
              "tokenizer": "keyword"
            }
          },
          "char_filter": {
            "nfkc_normalizer": {
              "name": "nfkc",
              "type": "icu_normalizer"
            }
          }
        },
        "number_of_replicas": "1",
        "uuid": "eeDA54FiQJ2FGq3tSVRfnQ",
        "version": {
          "created": "5040199"
        }
      }
    }
  }
}

I wrote a Python script which create the index, populate the DB with 3 documents and do the searchs, it can be downloaded from here.

What am I doing wrong?

Thanks!!!

mconsoni · June 14, 2017, 7:59pm

One more thing; using nfc instead of nfck, everything works right. Why?

mconsoni · June 22, 2017, 3:02pm

Any clue?

system · July 20, 2017, 3:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom normalisation and filtering? Elasticsearch	10	1599	July 6, 2017
Problem searching queries with accents Elasticsearch	10	13148	July 6, 2017
U-umlaut search --> indexing user name müller , search fails for müller but success for muller Elasticsearch	6	6360	July 5, 2017
Word with accent and searching Elasticsearch	5	1126	July 6, 2017
Custom normalization/filtering? Elasticsearch	1	374	July 6, 2017

Problem with icu_normalizer and acute accent

Related topics