Seeking advice on our index schema design

We are building our search platform which will support search service (indexing and searching) for multiple sites (might be 100+ sites) with multi-language (about 50 languages) support.
Some basic requirements:

  1. All sites will share some common index fields, but allow some site specific fields.
  2. We may show mixed search results from multiple sites
  3. Anytime when a new site added, the index of existing sites should not be updated.

Here some some options so far we are considering.

  1. Build different index for different site
    Cons:

    1. Hard to control the relevance of cross sites search results, since different sites are in different index.
    2. hard to maintain, 100+ sites will have 100+ indices, seems too many
    3. Too many indices might have performance issue
  2. Build one big index shared by all sites, but different site has different type under the index
    Cons:
    1) Anytime when a site need update its site specific mapping, NO efficient way to only re-index docs for that type, In ES 1.x, it is possible to delete the old type and add a new type, but in ES 2.x, it is not able to delete the type anymore, so if one type need update the mapping, seems the only way is to re-create the whole index and re-index.

Both 1 & 2 need build language specific multi-fields for each language for a fields need language analyzer, for example, for "title" fields, we need define "title_en", "title_de", "title_fr".... etc.

None of above two options is perfect.

Does anyone has better solutions?

I was in a similar situation a few years ago and ended up doing the index-per-site option. I disagree that it's harder to control relevance this way. In fact, I would say it's easier and contains fewer surprises. Also, it's not harder to maintain those indexes if you have some basic tools to handle migrating aliases, templates, etc. And you aren't likely going to run into performance issues with < 1000 indexes, even if you run it on a laptop.

The work I did is open-source and may actually be useful to you as is: https://github.com/GSA/i14y . It handles multiple languages, index templates, zero-downtime reindexing via aliasing, and one of the use cases is to search across multiple indexes instead of just one.

The documentation around it is minimal, as it was really more "coding the open" than "open source software", but I would be happy to help you get going with it if you like.

Very thanks Ioren for sharing!

Could you please share some your capacity data

  1. How many indices on our cluster?
  2. Total size of your indices?

Sorry, I wrote the software ~2yrs ago and and am no longer involved with its operation.

Any other experts can share your comments?