i try to resolve the issue with search the names on greek. if a greek name in index is uppercase and search is done with lowercase, then i don't receive any results. however, if the same scenario applies to english names then the search works well. i try to use such custom analyzer:
analyzer: {
greek_analyzer: {
type : custom,
tokenizer : standard,
filter : [my_greek_lowercase]
}
},
filter : {
my_greek_lowercase : {
type : lowercase
name : greek
}
}
}
the value of fullName in the index could be 'ΝΙΚΟΣ' but search for that value could be done in lowercase 'νικος', what is the same name. If we are talking about english values then there is no problem, f.e. value in index is JOHN but i search by john and i can find that name.
What do you mean the full script? If i understand you correctly, we are using Java to create/update index and we define there mapping and according to that mapping we are updating it by using JestClient. I have to check with our devops about upgrading it. currently, we are using version 2.3. could it be too old?
Knowing that wildcard queries are not analyzed, for sure νικος does not match ΝΙΚΟΣ. Not sure BTW what ΝΙΚΟΣ is rendered to when using a standard analyzer.
If you are using a specific analyzer and mapping, please provide them within the script.
A script is something like:
DELETE index
PUT index
{
// Index settings and mapping here
}
PUT index/doc/1
{
"foo": "bar"
}
GET index/_search
{
"query": {
"match": {
"foo": "bar"
}
}
}
Yes, I'm using specific analyzer which was working well for english(it is called 'skipVerbs'). once we needed to add a greek we got mentioned issue and i tried to play around with additional analyzer for greek language but no luck with it.
Yes, we use UTF-8.
You can try to lowercase this name 'ΝΙΚΟΣ' on your machine and to compare the results. but, yes, the issue is very interesting. pity, that i have no time to investigate it more...
if you have a chance and wish to take a look into it, please, let me know the results
/**
* Converts each unicode codepoint to lowerCase via {@link Character#toLowerCase(int)} starting
* at the given offset.
* @param buffer the char buffer to lowercase
* @param offset the offset to start at
* @param limit the max char in the buffer to lower case
*/
public static void toLowerCase(final char[] buffer, final int offset, final int limit) {
assert buffer.length >= limit;
assert offset <=0 && offset <= buffer.length;
for (int i = offset; i < limit;) {
i += Character.toChars(
Character.toLowerCase(
Character.codePointAt(buffer, i, limit)), buffer, i);
}
}
Which calls the java.util classes then.
Indeed, "ΝΙΚΟΣ".toLowerCase() produces a different result: νικος:
String text = "ΝΙΚΟΣ";
for (int i = 0; i < text.length(); i++) {
char c = Character.toLowerCase(text.charAt(i));
}
This produces νικοσ.
Anyone speaking greek can tell which form is correct or wrong?
i didn't play with analyzers but i did the fix by using java. i have just introduced new property in the index with lowercase value and i convert this value to lowercase by using java(no need to use an elastic search analyzers). my fix should resolve an issue with other similar languages in future.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.