Multiple languages documents

Gerardo_Zenobi · July 18, 2022, 10:57am

Hi there

Context

Within a particular engine I will have documents in multiple languages. e.g

Document 1 -> english
Document 2 -> french
Document 3 -> slovenian
...

At the same time, each of those documents could be translated to multiple languages

Document 1 translated to: spanish and swedish
Document 2 translated to: english and vietnamese
Document 3 translated to: english and polish
...

An example document outline for document 1 with search fields being name and description:

{
  "_id": "62d52bd7e82",
  "author":"Gerardo Zenobi",

  "lang":"en",
  "name":"A Course in English",
  "description":"<p><span style=\"color: #000000;\">Welcome to our organization! Enjoy your training.</span></p>",

  "translations":[
    {
      "lang":"es",
      "name":"Un curso en Español",
      "description":"<p>Bienvenidos a nuestra organizaci&oacute;n!&nbsp; Disfruten su entrenamiento.</p>"
    },
    {
      "lang":"sv",
      "description":"<p>V&auml;lkommen till v&aring;r organisation! Njut av din utbildning.</p>",
      "name":"En kurs i svenska"
    }
  ],
  "skills":[ "Onboarding" ]
}

Question

As users will search in different languages, given their mother tongue for example, I would like for this content to be surfaced accordingly: how should I model / configure this ?

Carlos_D · July 18, 2022, 11:08am

Hi @Gerardo_Zenobi !

The usual way to deal with multiple languages is:

Create a separate Engine for each language, with the corresponding language optimization.
Ingest each document translation into the corresponding language Engine.
Use Meta Engines to create a new Engine with all the previous engines as source engines.
Perform searching using the newly created Meta Engine.

That should give you both flexibility in terms of having a separate Engine per language (and users could select a specific language / Engine if needed be), as well as the best search experience for different languages (as each Engine will be tailored to a specific language settings).

Gerardo_Zenobi · July 18, 2022, 1:47pm

Thanks for the rapid response @Carlos_D !

I see some potential problems with this approach:

maintaining that solution could be somewhat complex (dynamically adding engines via the API, modifying the meta engine, keeping up-to-date relevance tuning in all engines, ...)
- maybe creating all engines beforehand and configuring tunning only on the meta engine would alliviate this pain ?
possibly less performant ? (Instead of just querying 1 engine ?)
we have to support more languages than those supported by the language optimized engines.
- do we need to put all the non supported documents and/or translations in a universal engine that would also be part of the meta-engine ?

Would there be another approach that would produce good results ? (see my last paragraph about only having 1 universal engine)

In any case, we always need to treat the translations as first class documents correct (we can't search in nested content in a different language ?). E.g. if I were to only use 1 universal engine, I would have to duplicate each document per each translation it has, is that correct ? ( document 1 from the example above would have to be indexed as 3 documents <english, spanish and swedish>)

Carlos_D · July 18, 2022, 4:56pm

if I were to only use 1 universal engine, I would have to duplicate each document per each translation it has, is that correct ?

That is one way of doing it; another would be to create different fields in each of the document for each translation (that is, content_en, content_es, content_de, etc) and avoid duplication. That way, duplicate results will not be possible when searching for content that is the same in different languages (like location names, part numbers, or any other language-agnostic content)

We tend to recommend the duplication and specific engine approach for multi-language documents, as it is the most flexible and tuneable of the two.

maybe creating all engines beforehand and configuring tunning only on the meta engine would alliviate this pain ?

do we need to put all the non supported documents and/or translations in a universal engine that would also be part of the meta-engine ?

Definitely! You could create separate engines beforehand for the language-optimised ones, and have another "catch-all" engine that uses universal language where docs for other languages are stored.

Keep in mind that Meta Engines cannot be individually tuned; result settings, curations and relevance tuning apply to all its source engines.

possibly less performant ? (Instead of just querying 1 engine ?)

Elasticsearch can deal with that, it should not involve a heavy performance penalty.

In the end, the best solution to be used will depend on your content and use cases. What I would suggest is to take a look and do some proofs of concept with the following approaches:

A single, universal engine with duplicate documents (one doc for each language)
A single, universal engine without duplicate documents (each doc will have translations in different fields)
Multiple, language-specific engines with documents specific to the language (and another, default engine for non language optimized documents).

Gerardo_Zenobi · July 19, 2022, 8:28am

Thank you @Carlos_D for the thorough answer ! Really helpful.

That way, duplicate results will not be possible when searching for content that is the same in different languages

That's a very good point to have in mind, I had not thought about it

Keep in mind that Meta Engines cannot be individually tuned; result settings, curations and relevance tuning apply to all its source engines.

I had noticed we do can tune a meta engine Did you mean we can't tune a source engine at the level of a meta engine ?
Note that if going with the meta-engine approach, my idea was to leave source engines without any tuning (as they would be used only through the meta-engine) and only do the tuning in the meta engine (affecting all of its source engines; all schemas would be the same) for simpler maintainability. Is this a valid approach regarding tuning ?

We tend to recommend the duplication and specific engine approach for multi-language documents, as it is the most flexible and tuneable of the two.

With this approach we would still be at risk (depending on our fields and their values/translations) of having duplicated results right (c.f. " ... when searching for content that is the same in different languages...") ?

Carlos_D · July 20, 2022, 12:35pm

You are totally correct. Meta Engines can be tuned; I was referring to the possibility of individually tuning each of its source engines. Sorry for not being clear!

With this approach we would still be at risk...of having duplicated results

Yes. A way of removing that risk is to use grouping to collapse multi language documents on a document id.

This approach needs every duplicated document to share the same id, and to have different fields for each language.

Gerardo_Zenobi · July 20, 2022, 12:58pm

Thanks once again @Carlos_D !

Yes. A way of removing that risk is to use grouping to collapse multi language documents on a document id.

In my previous response I was thinking that it would be cool if such mechanism existed (merging by id) but seemed a bit far out and didn't even mention/suggest it: glad it already exists!

This approach needs every duplicated document to share the same id and to have different fields for each language.

Sorry I got lost here. All my source engines, representing the same object but in a different language, would share the have schema/fields. What did you mean by "to have different fields for each language." ? See the example below for a meta engine courses with source engines:

Source engine courses_spanish:

{
 _id: "123"
name: "Pan de masa madre 101",
description: "Aqui aprenderas a cocinar panes caseros de manera simple y eficaz"
}

Source engine courses_english:

{
 _id: "123"
name: "Sourdough 101",
description: "Here you will learn how to bake sourdough in a simple and effective way."
}

Carlos_D · July 21, 2022, 4:57pm

This approach needs every duplicated document to share the same id and to have different fields for each language .

My bad, that is not necessary. I was thinking on identifying the matching documents in the group once you got the results.

Give it a try and we can check the results together!

Gerardo_Zenobi · July 26, 2022, 8:31am

Here is a sample result ( so that other readers can check it out if interested)

Having grouped on:

    "group": {
        "field": "id"
    },

I got:

{
  "meta": {
    "alerts": [],
    "warnings": [],
    "precision": 2,
    "engine": {
      "name": "multi-language-courses",
      "type": "meta"
    },
    "page": {
      "current": 1,
      "total_pages": 312,
      "total_results": 6234,
      "size": 20
    },
    "request_id": "BjIgtVnfdfsfsfd"
  },
  "results": [
    {
      "skills": {
        "raw": [],
        "snippet": null
      },
      "lang": {
        "raw": "en",
        "snippet": "en"
      },
      "name": {
        "raw": "SFDC demo Integration",
        "snippet": "SFDC <em>demo</em> <em>Integration</em>"
      },
      "description": {
        "raw": "<p>In this course, you will learn :&nbsp;</p>\n<ul>\n<li>What is the SFDC integration is about</li>\n<li>How to pitch and demo it</li>\n<li>What resources you can leverage</li>\n</ul>",
        "snippet": "&lt;p&gt;In this course, you will learn :&amp;nbsp;&lt;&#x2F;p&gt;\n&lt;ul&gt;\n&lt;li&gt;What is the SFDC <em>integration</em> is about&lt;&#x2F;li"
      },
      "_meta": {
        "id": "document-id-123",
        "engine": "courses",
        "score": 199.45448
      },
      "id": {
        "raw": "courses|document-id-123",
        "snippet": null
      },
      "_group": [
        {
          "skills": {
            "raw": [],
            "snippet": null
          },
          "lang": {
            "raw": "fr",
            "snippet": "fr"
          },
          "name": {
            "raw": "SFDC demo Integration",
            "snippet": "SFDC <em>demo</em> <em>Integration</em>"
          },
          "description": {
            "raw": "<p>Dans ce module, tu apprendras :&nbsp;</p>\n<ul>\n<li>Les principes de notre int&eacute;gration avec Salesforce</li>\n<li>Comment la pitcher et faire la d&eacute;monstration</li>\n<li>Quelles ressources tu peux utiliser</li>\n</ul>",
            "snippet": "&lt;p&gt;Dans ce module, tu apprendras :&amp;nbsp;&lt;/p&gt;\n&lt;ul&gt;\n&lt;li&gt;Les principes de notre int&amp;eacute;gration"
          },
          "_meta": {
            "id": "document-id-123",
            "engine": "courses-fr",
            "score": 67.24097
          },
          "id": {
            "raw": "courses-fr|document-id-123",
            "snippet": null
          }
        }
      ],
      "_group_key": "document-id-123"
    }
  ],

  // ...

}

system · August 23, 2022, 8:32am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multilingual field handling with multiple fields in ES Elasticsearch	4	1875	July 6, 2017
Will this document structure work for multiple language indexing? Elasticsearch	2	882	July 5, 2017
Best way to index multiple languages Elasticsearch	9	10221	July 6, 2017
Bets practice for indexing documents of various languages Elasticsearch	3	537	July 19, 2017
Handling multiple languages Elasticsearch	1	295	July 6, 2017

Multiple languages documents

Context

Question

Related topics