Best practices for indexing documents with alternate names in many languages


(Ævar Arnfjörð Bjarmason) #1

I have a dataset that includes a lot of records that I'm currently
re-indexing for each language. E.g. each of these will be separate
documents:

{
    "name": "London",
    "lang": "en",
    *data*
}

{
    "name": "Londra",
    "lang": "it",
    *same data*
}

I'd like to eliminate this redundancy and have something like:

{
    "names": [
        {
            "name": "London",
            "lang": "en",
        },
        {
            "name": "Londra",
            "lang": "it",
        },
    ],
    *data*
}

However as I understand it that'll be flattened to this internally:

{
    "names": [
        {
            "name": "London Londra",
            "lang": "en it",
        },
    ],
    *data*
}

One alternative schema would be:

{
    "name_en": "London",
    "name_it": "Londra",
    *same data*
}

Or using the the initial suggestion but with the nested type
(http://www.elasticsearch.org/guide/reference/mapping/nested-type.html).

What I'm actually trying to accomplish here is:

  • Reduce noise in my search queries: Currently if I search for
    e.g. "London" I'll end up getting documents back for all the
    "London"'s that happen to be called that in their native language
    (which is quite a lot).

    I de-dupe these myself but would like to have ElasticSearch do that
    for me.

  • Not to have a bias towards entries that happen to have more / fewer
    translations. E.g. if I have 4 translations of London I might end
    up with a string like:

    "London Londra Lundúnir London"
    

    And I don't want London artificially inflated / deflated in score
    just because it has more duplicates. Whereas e.g. I might have no
    translations for Tripoli at all (or more plausably, a city called
    "London" in the U.S.)

  • I'd like to use the language-specific indexing features of Lucene
    to e.g. apply German stemming to the German version of the names,
    and perhaps have a different tokenizer for asian languages.

    I can't see how to do that with nested types:
    http://www.elasticsearch.org/guide/reference/mapping/nested-type.html

    However I could do that just by creating a lot of columns like:

    name_en, name_it, name_de, ...

    But then I'd have to come up with a query that searches through all
    of those, but doesn't have a bias towards documents that match the
    name in more than one language. I.e. it would try them all and just
    pick the best score.

Maybe there's a much better way to do this that I've missed. But
hopefully I've given enough information for the list to either
recommend one of the above, or tell me that I'm being completely stupid
and recommend something else.

Thanks in advance for any help you can give.


(system) #2