I have a dataset that includes a lot of records that I'm currently
re-indexing for each language. E.g. each of these will be separate
documents:
{
"name": "London",
"lang": "en",
*data*
}
{
"name": "Londra",
"lang": "it",
*same data*
}
I'd like to eliminate this redundancy and have something like:
{
"names": [
{
"name": "London",
"lang": "en",
},
{
"name": "Londra",
"lang": "it",
},
],
*data*
}
However as I understand it that'll be flattened to this internally:
{
"names": [
{
"name": "London Londra",
"lang": "en it",
},
],
*data*
}
One alternative schema would be:
{
"name_en": "London",
"name_it": "Londra",
*same data*
}
Or using the the initial suggestion but with the nested type
(http://www.elasticsearch.org/guide/reference/mapping/nested-type.html).
What I'm actually trying to accomplish here is:
-
Reduce noise in my search queries: Currently if I search for
e.g. "London" I'll end up getting documents back for all the
"London"'s that happen to be called that in their native language
(which is quite a lot).I de-dupe these myself but would like to have ElasticSearch do that
for me. -
Not to have a bias towards entries that happen to have more / fewer
translations. E.g. if I have 4 translations of London I might end
up with a string like:"London Londra Lundúnir London"
And I don't want London artificially inflated / deflated in score
just because it has more duplicates. Whereas e.g. I might have no
translations for Tripoli at all (or more plausably, a city called
"London" in the U.S.) -
I'd like to use the language-specific indexing features of Lucene
to e.g. apply German stemming to the German version of the names,
and perhaps have a different tokenizer for asian languages.I can't see how to do that with nested types:
http://www.elasticsearch.org/guide/reference/mapping/nested-type.htmlHowever I could do that just by creating a lot of columns like:
name_en, name_it, name_de, ...
But then I'd have to come up with a query that searches through all
of those, but doesn't have a bias towards documents that match the
name in more than one language. I.e. it would try them all and just
pick the best score.
Maybe there's a much better way to do this that I've missed. But
hopefully I've given enough information for the list to either
recommend one of the above, or tell me that I'm being completely stupid
and recommend something else.
Thanks in advance for any help you can give.