Search UTF-8 text for both diacritics and non-diacritics variants

Julian · July 31, 2011, 10:37am

Hi,

I am using Elastic Search to index a lot of data in Romanian,
containing specific characters in UTF-8 encoding: ș, ț, î, ă, â, Ă, Â,
Ț, Ș, Î.
The indexing works fine and so does the searching - I'm using PHP with
the elastica client.

Now I'm trying to some searches and I want Elastic Search to match the
text with both diacritics and without them.
Let me give you an example: I'm searching for "Bucuresti" (which is
Romanian for Bucharest) and currently Elastic Search is correctly
returning the results. However I would like to also get the results
for "București" (which is the correct form containing the diacritics)
when I search for "Bucuresti".

For example, try to search for "Bucuresti" with Google - it will
return results for both "Bucuresti" and "București".

How can I do this with Elastic Search?

Thanks,
Julian.

kimchy · July 31, 2011, 3:01pm

have you tried using ascii folding filter with custom analysis on the
relevant fields?

On Sun, Jul 31, 2011 at 1:37 PM, Julian iulian@iconmedia.ro wrote:

Hi,

I am using Elastic Search to index a lot of data in Romanian,
containing specific characters in UTF-8 encoding: ș, ț, î, ă, â, Ă, Â,
Ț, Ș, Î.
The indexing works fine and so does the searching - I'm using PHP with
the elastica client.

Now I'm trying to some searches and I want Elastic Search to match the
text with both diacritics and without them.
Let me give you an example: I'm searching for "Bucuresti" (which is
Romanian for Bucharest) and currently Elastic Search is correctly
returning the results. However I would like to also get the results
for "București" (which is the correct form containing the diacritics)
when I search for "Bucuresti".

For example, try to search for "Bucuresti" with Google - it will
return results for both "Bucuresti" and "București".

How can I do this with Elastic Search?

Thanks,
Julian.