Phonetic search

Hi,

I am quite new to search so this might be a very naive question:

Scenario:
Imagine a source of international names (first names + last names): british
names, german names (including "Umlauts") and even arab and russian names
(totally different set of characters) etc. I want to search all of those
names by just using an english character set: So the german ö should be
searchable by writing oe or simply o etc. and cyrillic names should be
searchable via a phonetic search. Example: Michail Sergejewitsch
Gorbatschow
= Михаил Сергеевич Горбачёв My best bet so far is to use a
phonetic analysis like provided with this plugin
here: https://github.com/elasticsearch/elasticsearch-analysis-phonetic . At
least that's what I think, maybe there is another solution for this.

The point is: I am new to all this and I feel pretty lost. Is this even
possible with Elasticsearch alone or do I need to transform the data on the
application level before indexing it (converting cyrillic to western
characters before indexing them via Elasticsearch). Any ideas on how to do
that?

Thanks in advance!

Cheers,

Hannes

--

Character conversion is done via an ASCIIFoldingFilter in Lucene:

Not sure if it supports Cyrillic. Since it is a Lucene supplied filter, you
can read its documentation as well:
http://lucidworks.lucidimagination.com/display/solr/Filter+Descriptions#FilterDescriptions-ASCIIFoldingFilter

There is also an phonetic filter which is provided as a plugin:

Never used it, so I do not know how it differs from existing Lucene
phonetic filters such as
http://lucidworks.lucidimagination.com/display/solr/Phonetic+Matching

Cheers,

Ivan

On Thu, Jan 3, 2013 at 10:11 AM, Haensel thehaensel@gmail.com wrote:

names should be searchable via a phonetic searc

--

Hi Hannes,

there are different challenges:

  • name entity detection (person names and the corresponding person)

  • phonetic encoding of person names

  • transliteration of person names (for example cyrillic/latin)

To tackle these challenges, you may chose some of the following approaches:

a) Name entity matching. maybe against OpenCalais or FreeBase

b) Language detection (hard when only entity names of unknown origin are
given!), followed by language-specific phonetic encoding. Note that german
person names are almost unusable when encoded by double metaphone so I
donated the Haase Phonetic Encoder (enhanced Kölner Phonetik) to
elasticsearch-analysis-phonetic, it gave best results in my case.

c) Script Transliteration of Cyrillic to Latin, maybe with
elasticsearch-analysis-icu, I hope the Transliteration is exposed via the
ICU normalization API. See
also http://userguide.icu-project.org/transforms/general for ICU
transilteration

Searching for person name preferred forms together with their related forms
is a heavy effort, you have to group transliterated and phonetically
equivalent forms, which may even overlap, and you need a good threshold to
reduce the noise. This is out of scope of Elasticsearch.

In the library domain, your problem of identifying and matching person
names of various origin is very common in catalog building, when looking
for book authors. Preferred author name writings have been manually
collected and established since many decades. Large national authority
files exists that are recently joined together at an international level,
called VIAF http://viaf.org/, and they were opened to the public. There are
Open Data releases: http://viaf.org/viaf/data/ Note, the names are only
author names, Freebase has much more person names: http://www.freebase.com/
Downloads at http://wiki.freebase.com/wiki/Data_dumps

Hope this helps.

Best regards,

Jörg

--

Thank you both for your help!

@Jörg Prante

Unfortunately this is not a list of people that can be found elsewhere
(like Freebase) so I'll have to go the hard route I guess. At least there
doesn't have to be any exact entity matching. A plain list of results
matching the query more or less is enough.

Thanks again,

Hannes

On Thursday, January 3, 2013 7:11:39 PM UTC+1, Haensel wrote:

Hi,

I am quite new to search so this might be a very naive question:

Scenario:
Imagine a source of international names (first names + last names):
british names, german names (including "Umlauts") and even arab and russian
names (totally different set of characters) etc. I want to search all of
those names by just using an english character set: So the german ö should
be searchable by writing oe or simply o etc. and cyrillic names should be
searchable via a phonetic search. Example: Michail Sergejewitsch
Gorbatschow
= Михаил Сергеевич Горбачёв My best bet so far is to use a
phonetic analysis like provided with this plugin here:
GitHub - elastic/elasticsearch-analysis-phonetic: Phonetic Analysis Plugin for Elasticsearch . At
least that's what I think, maybe there is another solution for this.

The point is: I am new to all this and I feel pretty lost. Is this even
possible with Elasticsearch alone or do I need to transform the data on the
application level before indexing it (converting cyrillic to western
characters before indexing them via Elasticsearch). Any ideas on how to do
that?

Thanks in advance!

Cheers,

Hannes

--