Issue with greek language

olehd · June 22, 2017, 8:18pm

Hello,

i try to resolve the issue with search the names on greek. if a greek name in index is uppercase and search is done with lowercase, then i don't receive any results. however, if the same scenario applies to english names then the search works well. i try to use such custom analyzer:

                analyzer: {
                      greek_analyzer: {
                        type : custom,
                        tokenizer : standard,
                        filter : [my_greek_lowercase]
                      }
                    },

                   filter : { 
                        my_greek_lowercase : {
                            type : lowercase
                            name : greek
                        }
                    }
            }

and i have such mapping:

        fullName:{type:string, analyzer:some_my_default_analyzer, 
                          fields: {raw:{type: String, index: not_analyzed}, 
                                       greek: { type: string, analyzer: greek_analyzer }}}

do you have any ideas what could be wrong with this?

thank you!

dadoonet · June 22, 2017, 10:20pm

Could you provide a full recreation script as described in

It will help to better understand what you are doing.
Please, try to keep the example as simple as possible.

olehd · June 23, 2017, 1:17am

Here is the full mapping:

{
settings:{
    index: {
        analysis :{                    
            analyzer: {
              greek_analyzer: {
                type : custom,
                tokenizer : standard,
                filter : [my_greek_lowercase]
              },
              skipVerbs: {
                 type : custom,
                 tokenizer : standard,
                 filter : [standard, lowercase]
                }
            },
            filter : { 
                my_greek_lowercase : {
                    type : lowercase
                    name : greek
                }
            }
        }
    }
},
mappings : {
    user: {
       properties:{
            fullName:{
                type:string, analyzer:skipVerbs, 
                fields: {
                    raw:{type: String, index: not_analyzed}, 
                    greek: { type: string, analyzer: greek_analyzer }
                }
            }
       }
    }
}
 }

the value of fullName in the index could be 'ΝΙΚΟΣ' but search for that value could be done in lowercase 'νικος', what is the same name. If we are talking about english values then there is no problem, f.e. value in index is JOHN but i search by john and i can find that name.

Could you give me your opinion about it?

Thank you!

olehd · June 23, 2017, 1:42am

this is query i use for search:

                   {"query":{  
  "filtered":{  
     "query":{  
        "bool":{  
           "should":[  
              {  
                 "wildcard":{  
                    "fullName":"*νικος*"
                 }
              }
           ],
           "minimum_should_match":1
        }
     }
  }},"version":true
      }

dadoonet · June 23, 2017, 7:01am

Could you provide a full script that I can just run on 5.4.2?

And yeah, use 5.x if possible.

olehd · June 23, 2017, 12:14pm

What do you mean the full script? If i understand you correctly, we are using Java to create/update index and we define there mapping and according to that mapping we are updating it by using JestClient. I have to check with our devops about upgrading it. currently, we are using version 2.3. could it be too old?

dadoonet · June 23, 2017, 4:28pm

I mean a script I can paste in Kibana console and play with.

As explained in the very first post I linked to.

olehd · June 23, 2017, 6:51pm

I see.

this i use to put a greek name in uppercase:

 PUT index/type/1
{
  "fullName": "ΝΙΚΟΣ"
}

here i try to get this name by using lowercase:

POST index/type/_search
{"query":{  
      "filtered":{  
         "query":{  
            "bool":{  
               "should":[  
                  {  
                     "wildcard":{  
                        "fullName":"*νικος*"
                     }
                  }
               ],
               "minimum_should_match":1
            }
         }
      }},
    "version":true
    }

Sorry, I never used Kibana... Let me, please, know whether you need something more to reproduce it?

Thanks much for help!

dadoonet · June 26, 2017, 2:33pm

Knowing that wildcard queries are not analyzed, for sure νικος does not match ΝΙΚΟΣ. Not sure BTW what ΝΙΚΟΣ is rendered to when using a standard analyzer.

If you are using a specific analyzer and mapping, please provide them within the script.

A script is something like:

DELETE index
PUT index
{
  // Index settings and mapping here
}
PUT index/doc/1
{
  "foo": "bar"
}
GET index/_search
{
  "query": {
    "match": {
      "foo": "bar"
    }
  }
}

olehd · June 26, 2017, 3:01pm

Yes, I'm using specific analyzer which was working well for english(it is called 'skipVerbs'). once we needed to add a greek we got mentioned issue and i tried to play around with additional analyzer for greek language but no luck with it.

PUT some_index
{
"settings":{
    "index": {
        "analysis" :{                    
            "analyzer": {
              "greek_analyzer": {
                "type" : "custom",
                "tokenizer" : "standard",
                "filter" : ["my_greek_lowercase"]
              },
              "skipVerbs": {
                 "type" : "custom",
                 "tokenizer" : "standard",
                 "filter" : ["standard", "lowercase"]
                }
            },
            "filter" : { 
                "my_greek_lowercase" : {
                    "type" : "lowercase",
                    "name" : "greek"
                }
            }
        }
    }
},
"mappings" : {
    "user": {
       "properties":{
            "fullName":{
                "type": "string", 
                "analyzer":"skipVerbs", 
                "fields": {
                    "raw":{"type": "String", "index": "not_analyzed"}, 
                    "greek": { "type": "string", "analyzer": "greek_analyzer" }
                }
            }
       }
    }
}
 }

PUT some_index/some_type/1
{
  "fullName": "ΝΙΚΟΣ"
}

POST some_index/some_type/_search
{
	"query": {
		"filtered":{
			"query": {
				"bool":{
					"should":[{
						"wildcard":{ 
							"fullName" : "*νικος*" }
							} ],
							"minimum_should_match" : 1
				}
			}
		}
	},
	"version":true
}

Thanks!

dadoonet · June 26, 2017, 4:26pm

If you use the _analyze API , it will tell you how your string is actually indexed.

DELETE some_index
PUT some_index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "greek_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "my_greek_lowercase"
            ]
          },
          "skipVerbs": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "standard",
              "lowercase"
            ]
          }
        },
        "filter": {
          "my_greek_lowercase": {
            "type": "lowercase",
            "name": "greek"
          }
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "fullName": {
          "type": "text",
          "analyzer": "skipVerbs",
          "fields": {
            "raw": {
              "type": "keyword"
            },
            "greek": {
              "type": "text",
              "analyzer": "greek_analyzer"
            }
          }
        }
      }
    }
  }
}
POST some_index/_analyze
{
  "text": "ΝΙΚΟΣ",
  "analyzer": "greek_analyzer"
}

This gives:

{
  "tokens": [
    {
      "token": "νικοσ",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

νικοσ is not νικος (I can't comment on what it means in greek though).

Which means that the following match:

PUT some_index/doc/1
{
  "fullName": "ΝΙΚΟΣ"
}

POST some_index/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "wildcard": {
            "fullName.greek": "*νικοσ*"
          }
        }
      ]
    }
  }
}

olehd · June 26, 2017, 9:39pm

I see. I didn't know about _analyze API... It seems like Java and elasticsearch differently convert words to lowercase...

Thanks David!

dadoonet · June 28, 2017, 9:14am

Hmmm. Interesting (if I understood correctly what you meant).
I wonder if you have an encoding issue in your java Application. Are you using UTF-8?

olehd · June 28, 2017, 3:42pm

Yes, we use UTF-8.
You can try to lowercase this name 'ΝΙΚΟΣ' on your machine and to compare the results. but, yes, the issue is very interesting. pity, that i have no time to investigate it more...
if you have a chance and wish to take a look into it, please, let me know the results

Thanks!

dadoonet · June 28, 2017, 4:10pm

I looked at the Lucene source code.

The LowerCase filter does that: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/analysis/LowerCaseFilter.java#L42-L49

  @Override
  public final boolean incrementToken() throws IOException {
    if (input.incrementToken()) {
      CharacterUtils.toLowerCase(termAtt.buffer(), 0, termAtt.length());
      return true;
    } else
      return false;
  }

Which calls: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/analysis/CharacterUtils.java#L47-L62

  /**
   * Converts each unicode codepoint to lowerCase via {@link Character#toLowerCase(int)} starting 
   * at the given offset.
   * @param buffer the char buffer to lowercase
   * @param offset the offset to start at
   * @param limit the max char in the buffer to lower case
   */
  public static void toLowerCase(final char[] buffer, final int offset, final int limit) {
    assert buffer.length >= limit;
    assert offset <=0 && offset <= buffer.length;
    for (int i = offset; i < limit;) {
      i += Character.toChars(
              Character.toLowerCase(
                  Character.codePointAt(buffer, i, limit)), buffer, i);
     }
  }

Which calls the java.util classes then.

Indeed, "ΝΙΚΟΣ".toLowerCase() produces a different result: νικος:

String text = "ΝΙΚΟΣ";
for (int i = 0; i < text.length(); i++) {
    char c = Character.toLowerCase(text.charAt(i));
}

This produces νικοσ.

Anyone speaking greek can tell which form is correct or wrong?

dadoonet · June 28, 2017, 4:33pm

I spoke with @dliappis about this and actually the real lowercase form should be νίκος.

Probably for greek language you should use instead the analysis-icu plugin.
Specifically the https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-transform.html

Apparently (I have never used it), you can do lowercase transformation, as explained in http://userguide.icu-project.org/transforms/general with Any-Lower.

HTH.

olehd · June 29, 2017, 12:06am

i didn't play with analyzers but i did the fix by using java. i have just introduced new property in the index with lowercase value and i convert this value to lowercase by using java(no need to use an elastic search analyzers). my fix should resolve an issue with other similar languages in future.

thanks a lot for help!
i appreciate it a lot!

system · July 27, 2017, 12:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Language analysers Behaviour in ES Elasticsearch	3	662	July 5, 2017
Elasticsearch custom analyzer not working Elasticsearch	4	960	July 5, 2017
Analyzer from plugin works well when _analyze is called but does not work in search Elasticsearch	1	363	December 4, 2018
Is there a way to search terms lower cased? Elasticsearch	9	480	July 6, 2017
Requesting help with Case-insensitive Analyzer Elasticsearch	3	322	March 27, 2024

Issue with greek language

Related topics