Issue with greek language


(Oleh) #1

Hello,

i try to resolve the issue with search the names on greek. if a greek name in index is uppercase and search is done with lowercase, then i don't receive any results. however, if the same scenario applies to english names then the search works well. i try to use such custom analyzer:

                analyzer: {
                      greek_analyzer: {
                        type : custom,
                        tokenizer : standard,
                        filter : [my_greek_lowercase]
                      }
                    },

                   filter : { 
                        my_greek_lowercase : {
                            type : lowercase
                            name : greek
                        }
                    }
            }

and i have such mapping:

        fullName:{type:string, analyzer:some_my_default_analyzer, 
                          fields: {raw:{type: String, index: not_analyzed}, 
                                       greek: { type: string, analyzer: greek_analyzer }}}

do you have any ideas what could be wrong with this?

thank you!


(David Pilato) #2

Could you provide a full recreation script as described in

It will help to better understand what you are doing.
Please, try to keep the example as simple as possible.


(Oleh) #3

Here is the full mapping:

{
settings:{
    index: {
        analysis :{                    
            analyzer: {
              greek_analyzer: {
                type : custom,
                tokenizer : standard,
                filter : [my_greek_lowercase]
              },
              skipVerbs: {
                 type : custom,
                 tokenizer : standard,
                 filter : [standard, lowercase]
                }
            },
            filter : { 
                my_greek_lowercase : {
                    type : lowercase
                    name : greek
                }
            }
        }
    }
},
mappings : {
    user: {
       properties:{
            fullName:{
                type:string, analyzer:skipVerbs, 
                fields: {
                    raw:{type: String, index: not_analyzed}, 
                    greek: { type: string, analyzer: greek_analyzer }
                }
            }
       }
    }
}
 }

the value of fullName in the index could be 'ΝΙΚΟΣ' but search for that value could be done in lowercase 'νικος', what is the same name. If we are talking about english values then there is no problem, f.e. value in index is JOHN but i search by john and i can find that name.

Could you give me your opinion about it?

Thank you!


(Oleh) #4

this is query i use for search:

                   {"query":{  
  "filtered":{  
     "query":{  
        "bool":{  
           "should":[  
              {  
                 "wildcard":{  
                    "fullName":"*νικος*"
                 }
              }
           ],
           "minimum_should_match":1
        }
     }
  }},"version":true
      }

(David Pilato) #5

Could you provide a full script that I can just run on 5.4.2?

And yeah, use 5.x if possible.


(Oleh) #6

What do you mean the full script? If i understand you correctly, we are using Java to create/update index and we define there mapping and according to that mapping we are updating it by using JestClient. I have to check with our devops about upgrading it. currently, we are using version 2.3. could it be too old?


(David Pilato) #7

I mean a script I can paste in Kibana console and play with.

As explained in the very first post I linked to.


(Oleh) #8

I see.

this i use to put a greek name in uppercase:

 PUT index/type/1
{
  "fullName": "ΝΙΚΟΣ"
}

here i try to get this name by using lowercase:

POST index/type/_search
{"query":{  
      "filtered":{  
         "query":{  
            "bool":{  
               "should":[  
                  {  
                     "wildcard":{  
                        "fullName":"*νικος*"
                     }
                  }
               ],
               "minimum_should_match":1
            }
         }
      }},
    "version":true
    }

Sorry, I never used Kibana... Let me, please, know whether you need something more to reproduce it?

Thanks much for help!


(David Pilato) #9

Knowing that wildcard queries are not analyzed, for sure νικος does not match ΝΙΚΟΣ. Not sure BTW what ΝΙΚΟΣ is rendered to when using a standard analyzer.

If you are using a specific analyzer and mapping, please provide them within the script.

A script is something like:

DELETE index
PUT index
{
  // Index settings and mapping here
}
PUT index/doc/1
{
  "foo": "bar"
}
GET index/_search
{
  "query": {
    "match": {
      "foo": "bar"
    }
  }
}

(Oleh) #10

Yes, I'm using specific analyzer which was working well for english(it is called 'skipVerbs'). once we needed to add a greek we got mentioned issue and i tried to play around with additional analyzer for greek language but no luck with it.

PUT some_index
{
"settings":{
    "index": {
        "analysis" :{                    
            "analyzer": {
              "greek_analyzer": {
                "type" : "custom",
                "tokenizer" : "standard",
                "filter" : ["my_greek_lowercase"]
              },
              "skipVerbs": {
                 "type" : "custom",
                 "tokenizer" : "standard",
                 "filter" : ["standard", "lowercase"]
                }
            },
            "filter" : { 
                "my_greek_lowercase" : {
                    "type" : "lowercase",
                    "name" : "greek"
                }
            }
        }
    }
},
"mappings" : {
    "user": {
       "properties":{
            "fullName":{
                "type": "string", 
                "analyzer":"skipVerbs", 
                "fields": {
                    "raw":{"type": "String", "index": "not_analyzed"}, 
                    "greek": { "type": "string", "analyzer": "greek_analyzer" }
                }
            }
       }
    }
}
 }

PUT some_index/some_type/1
{
  "fullName": "ΝΙΚΟΣ"
}

POST some_index/some_type/_search
{
	"query": {
		"filtered":{
			"query": {
				"bool":{
					"should":[{
						"wildcard":{ 
							"fullName" : "*νικος*" }
							} ],
							"minimum_should_match" : 1
				}
			}
		}
	},
	"version":true
}

Thanks!


(David Pilato) #11

If you use the _analyze API , it will tell you how your string is actually indexed.

DELETE some_index
PUT some_index
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "greek_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "my_greek_lowercase"
            ]
          },
          "skipVerbs": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "standard",
              "lowercase"
            ]
          }
        },
        "filter": {
          "my_greek_lowercase": {
            "type": "lowercase",
            "name": "greek"
          }
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "fullName": {
          "type": "text",
          "analyzer": "skipVerbs",
          "fields": {
            "raw": {
              "type": "keyword"
            },
            "greek": {
              "type": "text",
              "analyzer": "greek_analyzer"
            }
          }
        }
      }
    }
  }
}
POST some_index/_analyze
{
  "text": "ΝΙΚΟΣ",
  "analyzer": "greek_analyzer"
}

This gives:

{
  "tokens": [
    {
      "token": "νικοσ",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

νικοσ is not νικος (I can't comment on what it means in greek though). :smiley:

Which means that the following match:

PUT some_index/doc/1
{
  "fullName": "ΝΙΚΟΣ"
}

POST some_index/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "wildcard": {
            "fullName.greek": "*νικοσ*"
          }
        }
      ]
    }
  }
}

(Oleh) #12

I see. I didn't know about _analyze API... It seems like Java and elasticsearch differently convert words to lowercase...

Thanks David!


(David Pilato) #13

Hmmm. Interesting (if I understood correctly what you meant).
I wonder if you have an encoding issue in your java Application. Are you using UTF-8?


(Oleh) #14

Yes, we use UTF-8.
You can try to lowercase this name 'ΝΙΚΟΣ' on your machine and to compare the results. but, yes, the issue is very interesting. pity, that i have no time to investigate it more...
if you have a chance and wish to take a look into it, please, let me know the results :slight_smile:

Thanks!


(David Pilato) #15

I looked at the Lucene source code.

The LowerCase filter does that: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/analysis/LowerCaseFilter.java#L42-L49

  @Override
  public final boolean incrementToken() throws IOException {
    if (input.incrementToken()) {
      CharacterUtils.toLowerCase(termAtt.buffer(), 0, termAtt.length());
      return true;
    } else
      return false;
  }

Which calls: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/analysis/CharacterUtils.java#L47-L62

  /**
   * Converts each unicode codepoint to lowerCase via {@link Character#toLowerCase(int)} starting 
   * at the given offset.
   * @param buffer the char buffer to lowercase
   * @param offset the offset to start at
   * @param limit the max char in the buffer to lower case
   */
  public static void toLowerCase(final char[] buffer, final int offset, final int limit) {
    assert buffer.length >= limit;
    assert offset <=0 && offset <= buffer.length;
    for (int i = offset; i < limit;) {
      i += Character.toChars(
              Character.toLowerCase(
                  Character.codePointAt(buffer, i, limit)), buffer, i);
     }
  }

Which calls the java.util classes then.

Indeed, "ΝΙΚΟΣ".toLowerCase() produces a different result: νικος:

String text = "ΝΙΚΟΣ";
for (int i = 0; i < text.length(); i++) {
    char c = Character.toLowerCase(text.charAt(i));
}

This produces νικοσ.

Anyone speaking greek can tell which form is correct or wrong?


(David Pilato) #16

I spoke with @dliappis about this and actually the real lowercase form should be νίκος.

Probably for greek language you should use instead the analysis-icu plugin.
Specifically the https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-transform.html

Apparently (I have never used it), you can do lowercase transformation, as explained in http://userguide.icu-project.org/transforms/general with Any-Lower.

HTH.


(Oleh) #17

i didn't play with analyzers but i did the fix by using java. i have just introduced new property in the index with lowercase value and i convert this value to lowercase by using java(no need to use an elastic search analyzers). my fix should resolve an issue with other similar languages in future.

thanks a lot for help!
i appreciate it a lot!


(system) closed #18

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.