Problem searching queries with accents

Felipe_Hummel · January 16, 2012, 4:24pm

Hi, I'm indexing brazilian portuguese text that contains accents. To remove
them I used the asciifolding filter. My "test" index settings is as follows:

{

"test": {

    "settings": {

        "index.analysis.analyzer.default.filter.0": "standard",

        "index.analysis.analyzer.default.tokenizer": "standard",

        "index.analysis.analyzer.default.filter.1": "lowercase",

        "index.analysis.analyzer.default.filter.2": "stop",

        "index.analysis.analyzer.default.filter.3": "asciifolding",

        "index.number_of_shards": "1",

        "index.number_of_replicas": "0"

    }

}

}

I indexed the word "não". When I search "nao" (no accent) the document is
retrieved. If I search for "não" no document is retrieved.

Something wrong with my configuration?

I'm using Curl to query elasticsearch.

Thanks

Felipe Hummel

Clinton_Gormley · January 16, 2012, 6:18pm

I indexed the word "nÃ£o". When I search "nao" (no accent) the document
is retrieved. If I search for "nÃ£o" no document is retrieved.

How are you searching? I bet you're using a 'term' query, which isn't
analyzed. Change that to a 'text' query, and it should work

clint

Felipe_Hummel · January 16, 2012, 6:59pm

That is right!

Actually I was also testing with the form:

http://localhost:9200/test/test1/_search?q=não

I suppose it just gets converted to a TermQuery. Because the following
query:

http://localhost:9200/test/teste1/_search?q=não+something

yields the right results.

Thanks

Felipe Hummel

Clinton_Gormley · January 16, 2012, 7:05pm

Actually I was also testing with the form:

http://localhost:9200/test/test1/_search?q=nÃ£o

Actually, that gets converted to a query_string query against the _all
field, which should have worked.

I wonder if it was a problem with your encoding.

Does this work?

curl -XGET 'http://127.0.0.1:9200/test/test1/_search?pretty=1&q=não

clint

Felipe_Hummel · January 16, 2012, 7:09pm

You're right, it must be some encoding problem. The url encoded version
works as expected.

Felipe Hummel

Frederic · January 16, 2012, 8:25pm

Hi Clint, i take advantage of this thread for a quite similar
question:

I've been performing some searches (ES 0.18.5) using accents as well
(in spanish) but, instead of using the 'asciifolding' filter at a
indexing time, I'd want to get similar results using fuzzy queries
(apart from getting also results for similar words).

I want to search docs using one or more words, based on a free text
field, called 'title', and I'd like to get the same results for both,
for instance, "bateria" and "batería" words. The query is:

  "query" : {
    "text" : {
      "title" : {
        "query" : "batería",
        "type" : "boolean",
        "operator" : "AND",
        "fuzziness" : "0.7",
        "max_expansions" : 3
      }
    }
  }

AFAIK a word with one accented letter is at distance '1' from the same
word with no accent, is this correct? If so, the query should consider
all docs that contains, in this case "batería" and "batería", right?

Right now I'm getting different number of results and I'm not sure
what could be the reason

Thanks in advance

Frederic

On 16 ene, 16:05, Clinton Gormley cl...@traveljury.com wrote:

Actually I was also testing with the form:
http://localhost:9200/test/test1/_search?q=não

Actually, that gets converted to a query_string query against the _all
field, which should have worked.

I wonder if it was a problem with your encoding.

Does this work?

curl -XGET 'http://127.0.0.1:9200/test/test1/_search?pretty=1&q=não

clint

Clinton_Gormley · January 17, 2012, 11:45am

I've been performing some searches (ES 0.18.5) using accents as well
(in spanish) but, instead of using the 'asciifolding' filter at a
indexing time, I'd want to get similar results using fuzzy queries
(apart from getting also results for similar words).

I want to search docs using one or more words, based on a free text
field, called 'title', and I'd like to get the same results for both,
for instance, "bateria" and "baterÃa" words. The query is:
  "query" : {
    "text" : {
      "title" : {
        "query" : "baterÃa",
        "type" : "boolean",
        "operator" : "AND",
        "fuzziness" : "0.7",
        "max_expansions" : 3
      }
    }
  }
AFAIK a word with one accented letter is at distance '1' from the same
word with no accent, is this correct? If so, the query should consider
all docs that contains, in this case "baterÃa" and "baterÃa", right?

If you change the "fuzziness" factor to 0.5, it will probably work. I
don't understand exactly what that number represents, so can't give you
more than a trial-and-error approach

That said, using a fuzzy query for this type of search is a lot heavier
than analyzing your text properly at index time.

clint

Frederic · January 17, 2012, 4:05pm

Thanks for your answer Clint, some comments:

If you change the "fuzziness" factor to 0.5, it will probably work.
Not really actually as a factor of 0.7 should be enough for matching
words at a distance of 1.

I don't understand exactly what that number represents, so can't give you
more than a trial-and-error approach
Just for the sake of providing info about this topic (this is what I
know so far, most likely Kimchy or some other Lucene expert will know
the right answer):

The 'fuzziness' factor refers to the 'minimunSimilarity' parameter of
a Lucene FuzzyQuery (Index of /__root/docs.lucene.apache.org/core/3_2_0/api/all/org
apache/lucene/search/Query.html): for a minimumSimilarity of 0.7, a
term of the same length as the query term is considered similar to the
query term if the edit distance between both terms is less than
length(term)*(1-0.7)

Where the distance value is based on an implementation of the'
Levenshtein Distance' algorithm (http://www.merriampark.com/ld.htm).

Thus, LD between "bateria" and "batería" is 1 (just one char change)
and length('batería')*0.3 = 2.1 > 1

That said, using a fuzzy query for this type of search is a lot heavier
than analyzing your text properly at index time.

Totally agree, it's just that in my case I need to work in an already
productive system with 50M docs indexed, so I cannot recreate the
index for changing the 'title' field analyzer.
The only idea I have so far, is to add another field to the type with
an 'asciifolding' analyzer, populate that field for all docs and
switch the field in which the searches are performing to the new one.

Thanks for your great support,

Frederic

On 17 ene, 08:45, Clinton Gormley cl...@traveljury.com wrote:

I've been performing some searches (ES 0.18.5) using accents as well
(in spanish) but, instead of using the 'asciifolding' filter at a
indexing time, I'd want to get similar results using fuzzy queries
(apart from getting also results for similar words).

I want to search docs using one or more words, based on a free text
field, called 'title', and I'd like to get the same results for both,
for instance, "bateria" and "batería" words. The query is:
  "query" : {
    "text" : {
      "title" : {
        "query" : "batería",
        "type" : "boolean",
        "operator" : "AND",
        "fuzziness" : "0.7",
        "max_expansions" : 3
      }
    }
  }
AFAIK a word with one accented letter is at distance '1' from the same
word with no accent, is this correct? If so, the query should consider
all docs that contains, in this case "batería" and "batería", right?

If you change the "fuzziness" factor to 0.5, it will probably work. I
don't understand exactly what that number represents, so can't give you
more than a trial-and-error approach

That said, using a fuzzy query for this type of search is a lot heavier
than analyzing your text properly at index time.

clint

Clinton_Gormley · January 17, 2012, 4:26pm

Totally agree, it's just that in my case I need to work in an already
productive system with 50M docs indexed, so I cannot recreate the
index for changing the 'title' field analyzer.
The only idea I have so far, is to add another field to the type with
an 'asciifolding' analyzer, populate that field for all docs and
switch the field in which the searches are performing to the new one.

You may want to take a look at multi-fields:

clint

Frederic · January 17, 2012, 6:37pm

That's exactly what I need. Thanks a lot

Fred
On 17 ene, 13:26, Clinton Gormley cl...@traveljury.com wrote:

Totally agree, it's just that in my case I need to work in an already
productive system with 50M docs indexed, so I cannot recreate the
index for changing the 'title' field analyzer.
The only idea I have so far, is to add another field to the type with
an 'asciifolding' analyzer, populate that field for all docs and
switch the field in which the searches are performing to the new one.

You may want to take a look at multi-fields:

Elasticsearch Platform — Find real-time answers at scale | Elastic...

clint

Topic		Replies	Views
Convert English to accents and then search Elasticsearch	2	423	July 26, 2017
Index analyzer problem with accent! Elasticsearch	1	337	July 6, 2017
Match queries and ASCII folding Elasticsearch	2	393	December 20, 2022
Word with accent and searching Elasticsearch	5	1107	July 6, 2017
Issue with asciiFolding filter and accents Elasticsearch	3	930	July 5, 2017

Problem searching queries with accents

Related topics