I am quite new to search so this might be a very naive question:
Scenario:
Imagine a source of international names (first names + last names): british
names, german names (including "Umlauts") and even arab and russian names
(totally different set of characters) etc. I want to search all of those
names by just using an english character set: So the german ö should be
searchable by writing oe or simply o etc. and cyrillic names should be
searchable via a phonetic search. Example: Michail Sergejewitsch
Gorbatschow = Михаил Сергеевич Горбачёв My best bet so far is to use a
phonetic analysis like provided with this plugin
here: https://github.com/elasticsearch/elasticsearch-analysis-phonetic . At
least that's what I think, maybe there is another solution for this.
The point is: I am new to all this and I feel pretty lost. Is this even
possible with Elasticsearch alone or do I need to transform the data on the
application level before indexing it (converting cyrillic to western
characters before indexing them via Elasticsearch). Any ideas on how to do
that?
name entity detection (person names and the corresponding person)
phonetic encoding of person names
transliteration of person names (for example cyrillic/latin)
To tackle these challenges, you may chose some of the following approaches:
a) Name entity matching. maybe against OpenCalais or FreeBase
b) Language detection (hard when only entity names of unknown origin are
given!), followed by language-specific phonetic encoding. Note that german
person names are almost unusable when encoded by double metaphone so I
donated the Haase Phonetic Encoder (enhanced Kölner Phonetik) to
elasticsearch-analysis-phonetic, it gave best results in my case.
c) Script Transliteration of Cyrillic to Latin, maybe with
elasticsearch-analysis-icu, I hope the Transliteration is exposed via the
ICU normalization API. See
also http://userguide.icu-project.org/transforms/general for ICU
transilteration
Searching for person name preferred forms together with their related forms
is a heavy effort, you have to group transliterated and phonetically
equivalent forms, which may even overlap, and you need a good threshold to
reduce the noise. This is out of scope of Elasticsearch.
In the library domain, your problem of identifying and matching person
names of various origin is very common in catalog building, when looking
for book authors. Preferred author name writings have been manually
collected and established since many decades. Large national authority
files exists that are recently joined together at an international level,
called VIAF http://viaf.org/, and they were opened to the public. There are
Open Data releases: http://viaf.org/viaf/data/ Note, the names are only
author names, Freebase has much more person names: http://www.freebase.com/
Downloads at http://wiki.freebase.com/wiki/Data_dumps
Unfortunately this is not a list of people that can be found elsewhere
(like Freebase) so I'll have to go the hard route I guess. At least there
doesn't have to be any exact entity matching. A plain list of results
matching the query more or less is enough.
Thanks again,
Hannes
On Thursday, January 3, 2013 7:11:39 PM UTC+1, Haensel wrote:
Hi,
I am quite new to search so this might be a very naive question:
Scenario:
Imagine a source of international names (first names + last names):
british names, german names (including "Umlauts") and even arab and russian
names (totally different set of characters) etc. I want to search all of
those names by just using an english character set: So the german ö should
be searchable by writing oe or simply o etc. and cyrillic names should be
searchable via a phonetic search. Example: Michail Sergejewitsch
Gorbatschow = Михаил Сергеевич Горбачёв My best bet so far is to use a
phonetic analysis like provided with this plugin here: GitHub - elastic/elasticsearch-analysis-phonetic: Phonetic Analysis Plugin for Elasticsearch . At
least that's what I think, maybe there is another solution for this.
The point is: I am new to all this and I feel pretty lost. Is this even
possible with Elasticsearch alone or do I need to transform the data on the
application level before indexing it (converting cyrillic to western
characters before indexing them via Elasticsearch). Any ideas on how to do
that?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.