One large index vs split indexes


We have a multitenant transactional SAAS application. I am busy building a prototype implementation for ElasticSeach. The idea is to index our transactional data then use it for fast search and analytical aggregations. The current transactional collection has 10m+ documents (500 GB+). Each tenant has a uniqueId and searches/aggregations will mostly be filtered by this id (there are some internal reporting which will be done across all tenants, but this is not the main focus of the solution).

The problem

Our system allows tenants to create custom properties. These properties are essential to each tenant and used in reporting and searching. These properties should be isolated per tenant. The custom properties should be part of the mapping as I use the mapping to dynamically present the available filter fields during runtime.

The Approach

  • Option A

This is the current way I am doing it.

I have a single large index, containing all transactions for all tenants. During runtime I add the uniqueId as a filter when searching. At index time, I flatten the custom properties and add the uniqueId as a sub object, so the fields are indexed like “CustomPropertyName.UniqueCustomerId.CustomPropertyValue”, this way I can supply the uniqueId and ‘build’ the search field during runtime and the field is in the mapping as required. I read in that a single large index is more efficient than smaller indexes, but as this article is quite old, I am not sure if this is relevant any longer.

  • Option B

I create separate indexes for each tenant, this way I don’t need to map the customerId as a sub object and it will still be isolated.

The question I have, is searching/aggregating on a smaller index more efficient/quicker than a single large index. Is there a better way to deal with the custom properties? I have read a few forum posts on this topic, but none seem to answer the question.

Very much appreciate your time.

Searching, as a general concept, is always faster on a single shard. But unless you are dealing with massive amounts of data, eg TBs per user, then in reality it may be the difference of a few hundred milliseconds. If that's super important to you, then you should test that and see what is best for you. I'd suggest it's not something you need to be overly worried about at your scale.

Why not create an ILM pattern per user? It'll cover your mapping requirements, and if this is transactional data then you should expect growth, which ILM will also help with.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.