Thanks for your answer Clint, some comments:
If you change the "fuzziness" factor to 0.5, it will probably work.
Not really actually as a factor of 0.7 should be enough for matching
words at a distance of 1.
I don't understand exactly what that number represents, so can't give you
more than a trial-and-error approach
Just for the sake of providing info about this topic (this is what I
know so far, most likely Kimchy or some other Lucene expert will know
the right answer):
The 'fuzziness' factor refers to the 'minimunSimilarity' parameter of
a Lucene FuzzyQuery (Index of /__root/docs.lucene.apache.org/core/3_2_0/api/all/org
apache/lucene/search/Query.html): for a minimumSimilarity of 0.7, a
term of the same length as the query term is considered similar to the
query term if the edit distance between both terms is less than
length(term)*(1-0.7)
Where the distance value is based on an implementation of the'
Levenshtein Distance' algorithm (http://www.merriampark.com/ld.htm).
Thus, LD between "bateria" and "batería" is 1 (just one char change)
and length('batería')*0.3 = 2.1 > 1
That said, using a fuzzy query for this type of search is a lot heavier
than analyzing your text properly at index time.
Totally agree, it's just that in my case I need to work in an already
productive system with 50M docs indexed, so I cannot recreate the
index for changing the 'title' field analyzer.
The only idea I have so far, is to add another field to the type with
an 'asciifolding' analyzer, populate that field for all docs and
switch the field in which the searches are performing to the new one.
Thanks for your great support,
Frederic
On 17 ene, 08:45, Clinton Gormley cl...@traveljury.com wrote:
I've been performing some searches (ES 0.18.5) using accents as well
(in spanish) but, instead of using the 'asciifolding' filter at a
indexing time, I'd want to get similar results using fuzzy queries
(apart from getting also results for similar words).
I want to search docs using one or more words, based on a free text
field, called 'title', and I'd like to get the same results for both,
for instance, "bateria" and "batería" words. The query is:
"query" : {
"text" : {
"title" : {
"query" : "batería",
"type" : "boolean",
"operator" : "AND",
"fuzziness" : "0.7",
"max_expansions" : 3
}
}
}
AFAIK a word with one accented letter is at distance '1' from the same
word with no accent, is this correct? If so, the query should consider
all docs that contains, in this case "batería" and "batería", right?
If you change the "fuzziness" factor to 0.5, it will probably work. I
don't understand exactly what that number represents, so can't give you
more than a trial-and-error approach
That said, using a fuzzy query for this type of search is a lot heavier
than analyzing your text properly at index time.
clint