Best practices for indexing documents with alternate names in many languages

AEvar_Arnfjord_Bjarm · September 9, 2011, 2:43pm

I have a dataset that includes a lot of records that I'm currently
re-indexing for each language. E.g. each of these will be separate
documents:

{
    "name": "London",
    "lang": "en",
    *data*
}

{
    "name": "Londra",
    "lang": "it",
    *same data*
}

I'd like to eliminate this redundancy and have something like:

{
    "names": [
        {
            "name": "London",
            "lang": "en",
        },
        {
            "name": "Londra",
            "lang": "it",
        },
    ],
    *data*
}

However as I understand it that'll be flattened to this internally:

{
    "names": [
        {
            "name": "London Londra",
            "lang": "en it",
        },
    ],
    *data*
}

One alternative schema would be:

{
    "name_en": "London",
    "name_it": "Londra",
    *same data*
}

Or using the the initial suggestion but with the nested type
(http://www.elasticsearch.org/guide/reference/mapping/nested-type.html).

What I'm actually trying to accomplish here is:

Reduce noise in my search queries: Currently if I search for
e.g. "London" I'll end up getting documents back for all the
"London"'s that happen to be called that in their native language
(which is quite a lot).

I de-dupe these myself but would like to have ElasticSearch do that
for me.
Not to have a bias towards entries that happen to have more / fewer
translations. E.g. if I have 4 translations of London I might end
up with a string like:
```
"London Londra Lundúnir London"
```
And I don't want London artificially inflated / deflated in score
just because it has more duplicates. Whereas e.g. I might have no
translations for Tripoli at all (or more plausably, a city called
"London" in the U.S.)
I'd like to use the language-specific indexing features of Lucene
to e.g. apply German stemming to the German version of the names,
and perhaps have a different tokenizer for asian languages.

I can't see how to do that with nested types:
http://www.elasticsearch.org/guide/reference/mapping/nested-type.html

However I could do that just by creating a lot of columns like:

name_en, name_it, name_de, ...

But then I'd have to come up with a query that searches through all
of those, but doesn't have a bias towards documents that match the
name in more than one language. I.e. it would try them all and just
pick the best score.

Maybe there's a much better way to do this that I've missed. But
hopefully I've given enough information for the list to either
recommend one of the above, or tell me that I'm being completely stupid
and recommend something else.

Thanks in advance for any help you can give.

Topic		Replies	Views
Best practices with localized indices Elasticsearch	3	4079	July 6, 2017
I18n searching Elasticsearch	2	996	May 26, 2017
Will this document structure work for multiple language indexing? Elasticsearch	2	883	July 5, 2017
ElasticSearch remove types from index Elasticsearch	7	734	January 16, 2018
Bets practice for indexing documents of various languages Elasticsearch	3	537	July 19, 2017

Best practices for indexing documents with alternate names in many languages

Related topics