Problem searching queries with accents


(Felipe Hummel) #1

Hi, I'm indexing brazilian portuguese text that contains accents. To remove
them I used the asciifolding filter. My "test" index settings is as follows:

{

"test": {

    "settings": {

        "index.analysis.analyzer.default.filter.0": "standard",

        "index.analysis.analyzer.default.tokenizer": "standard",

        "index.analysis.analyzer.default.filter.1": "lowercase",

        "index.analysis.analyzer.default.filter.2": "stop",

        "index.analysis.analyzer.default.filter.3": "asciifolding",

        "index.number_of_shards": "1",

        "index.number_of_replicas": "0"

    }

}

}

I indexed the word "não". When I search "nao" (no accent) the document is
retrieved. If I search for "não" no document is retrieved.

Something wrong with my configuration?

I'm using Curl to query elasticsearch.

Thanks

Felipe Hummel


(Clinton Gormley) #2

I indexed the word "não". When I search "nao" (no accent) the document
is retrieved. If I search for "não" no document is retrieved.

How are you searching? I bet you're using a 'term' query, which isn't
analyzed. Change that to a 'text' query, and it should work

clint


(Felipe Hummel) #3

That is right!

Actually I was also testing with the form:

http://localhost:9200/test/test1/_search?q=não

I suppose it just gets converted to a TermQuery. Because the following
query:

http://localhost:9200/test/teste1/_search?q=não+something

yields the right results.

Thanks

Felipe Hummel


(Clinton Gormley) #4

Actually I was also testing with the form:

http://localhost:9200/test/test1/_search?q=não

Actually, that gets converted to a query_string query against the _all
field, which should have worked.

I wonder if it was a problem with your encoding.

Does this work?

curl -XGET 'http://127.0.0.1:9200/test/test1/_search?pretty=1&q=não

clint


(Felipe Hummel) #5

You're right, it must be some encoding problem. The url encoded version
works as expected.

Felipe Hummel


(Frederic) #6

Hi Clint, i take advantage of this thread for a quite similar
question:

I've been performing some searches (ES 0.18.5) using accents as well
(in spanish) but, instead of using the 'asciifolding' filter at a
indexing time, I'd want to get similar results using fuzzy queries
(apart from getting also results for similar words).

I want to search docs using one or more words, based on a free text
field, called 'title', and I'd like to get the same results for both,
for instance, "bateria" and "batería" words. The query is:

  "query" : {
    "text" : {
      "title" : {
        "query" : "batería",
        "type" : "boolean",
        "operator" : "AND",
        "fuzziness" : "0.7",
        "max_expansions" : 3
      }
    }
  }

AFAIK a word with one accented letter is at distance '1' from the same
word with no accent, is this correct? If so, the query should consider
all docs that contains, in this case "batería" and "batería", right?

Right now I'm getting different number of results and I'm not sure
what could be the reason

Thanks in advance

Frederic

On 16 ene, 16:05, Clinton Gormley cl...@traveljury.com wrote:

Actually I was also testing with the form:
http://localhost:9200/test/test1/_search?q=não

Actually, that gets converted to a query_string query against the _all
field, which should have worked.

I wonder if it was a problem with your encoding.

Does this work?

curl -XGET 'http://127.0.0.1:9200/test/test1/_search?pretty=1&q=não

clint


(Clinton Gormley) #7

I've been performing some searches (ES 0.18.5) using accents as well
(in spanish) but, instead of using the 'asciifolding' filter at a
indexing time, I'd want to get similar results using fuzzy queries
(apart from getting also results for similar words).

I want to search docs using one or more words, based on a free text
field, called 'title', and I'd like to get the same results for both,
for instance, "bateria" and "batería" words. The query is:

  "query" : {
    "text" : {
      "title" : {
        "query" : "batería",
        "type" : "boolean",
        "operator" : "AND",
        "fuzziness" : "0.7",
        "max_expansions" : 3
      }
    }
  }

AFAIK a word with one accented letter is at distance '1' from the same
word with no accent, is this correct? If so, the query should consider
all docs that contains, in this case "batería" and "batería", right?

If you change the "fuzziness" factor to 0.5, it will probably work. I
don't understand exactly what that number represents, so can't give you
more than a trial-and-error approach :slight_smile:

That said, using a fuzzy query for this type of search is a lot heavier
than analyzing your text properly at index time.

clint


(Frederic) #8

Thanks for your answer Clint, some comments:

If you change the "fuzziness" factor to 0.5, it will probably work.
Not really actually as a factor of 0.7 should be enough for matching
words at a distance of 1.

I don't understand exactly what that number represents, so can't give you
more than a trial-and-error approach :slight_smile:
Just for the sake of providing info about this topic (this is what I
know so far, most likely Kimchy or some other Lucene expert will know
the right answer):

The 'fuzziness' factor refers to the 'minimunSimilarity' parameter of
a Lucene FuzzyQuery (http://lucene.apache.org/java/3_2_0/api/all/org/
apache/lucene/search/Query.html): for a minimumSimilarity of 0.7, a
term of the same length as the query term is considered similar to the
query term if the edit distance between both terms is less than
length(term)*(1-0.7)

Where the distance value is based on an implementation of the'
Levenshtein Distance' algorithm (http://www.merriampark.com/ld.htm).

Thus, LD between "bateria" and "batería" is 1 (just one char change)
and length('batería')*0.3 = 2.1 > 1

That said, using a fuzzy query for this type of search is a lot heavier
than analyzing your text properly at index time.

Totally agree, it's just that in my case I need to work in an already
productive system with 50M docs indexed, so I cannot recreate the
index for changing the 'title' field analyzer.
The only idea I have so far, is to add another field to the type with
an 'asciifolding' analyzer, populate that field for all docs and
switch the field in which the searches are performing to the new one.

Thanks for your great support,

Frederic

On 17 ene, 08:45, Clinton Gormley cl...@traveljury.com wrote:

I've been performing some searches (ES 0.18.5) using accents as well
(in spanish) but, instead of using the 'asciifolding' filter at a
indexing time, I'd want to get similar results using fuzzy queries
(apart from getting also results for similar words).

I want to search docs using one or more words, based on a free text
field, called 'title', and I'd like to get the same results for both,
for instance, "bateria" and "batería" words. The query is:

  "query" : {
    "text" : {
      "title" : {
        "query" : "batería",
        "type" : "boolean",
        "operator" : "AND",
        "fuzziness" : "0.7",
        "max_expansions" : 3
      }
    }
  }

AFAIK a word with one accented letter is at distance '1' from the same
word with no accent, is this correct? If so, the query should consider
all docs that contains, in this case "batería" and "batería", right?

If you change the "fuzziness" factor to 0.5, it will probably work. I
don't understand exactly what that number represents, so can't give you
more than a trial-and-error approach :slight_smile:

That said, using a fuzzy query for this type of search is a lot heavier
than analyzing your text properly at index time.

clint


(Clinton Gormley) #9

Totally agree, it's just that in my case I need to work in an already
productive system with 50M docs indexed, so I cannot recreate the
index for changing the 'title' field analyzer.
The only idea I have so far, is to add another field to the type with
an 'asciifolding' analyzer, populate that field for all docs and
switch the field in which the searches are performing to the new one.

You may want to take a look at multi-fields:

http://www.elasticsearch.org/guide/reference/mapping/multi-field-type.html

clint


(Frederic) #10

That's exactly what I need. Thanks a lot

Fred
On 17 ene, 13:26, Clinton Gormley cl...@traveljury.com wrote:

Totally agree, it's just that in my case I need to work in an already
productive system with 50M docs indexed, so I cannot recreate the
index for changing the 'title' field analyzer.
The only idea I have so far, is to add another field to the type with
an 'asciifolding' analyzer, populate that field for all docs and
switch the field in which the searches are performing to the new one.

You may want to take a look at multi-fields:

http://www.elasticsearch.org/guide/reference/mapping/multi-field-type...

clint


(system) #11