Breaking index into small indices with duplicates items across indices

I am facing a problem that look pretty popular from what I saw on the forum, but always different from mine.

I currently have one big index (actually one big alias split into several ILM indices) which contain data available for all my clients. The fact is some clients may specifically be interested in some items, and I'd like to let those clients (1) retrieve them and (2) make queries only on those items. Currently no item is tagged as "interesting for client X", this is what I want to achieve, and to do so I thought about two approaches.

The first one is to create one index per client and insert into it every document that is potentially interesting for this client. Some items are interesting for many clients, therefore will be duplicated in many of those indices. When a client wants to perform a search on his items of interests I will query his specific index. This approach satisfy me because, as the index will be relatively small, the query will run much faster than in the big index, but I'm hesitating because of the duplicated data, and the overhead so many indices could generate.

The second approach is to add a field in every item that contain the ids of the clients that could potentially be interested in that item (this field contain a list of ids), and then when a client wants to perform a search on his items of interest, I will only append a filter on this field to the query. This approach sounds good to me because no duplication of the item and no creation of a lot of indices is made. But I don't know how fast this will be in comparison with the first approach.

I assume the number of items interesting for many organizations is not negligible. Both of the approaches allow user to keep querying the global data (items that are not potentially interesting), which is one of the constraints I have.

As I'm not myself an elasticsearch expert, I don't know what are the consequences of both solutions. I'd be glad to hear what you think are the good and the bad points of each approach, as well as other interesting approaches.

Both have benefits and costs, and at the end of the day they will likely be about the same. TLDR use whichever you think will work the best and you can implement and maintain.

Can you explain me more in details what are the benefits and costs of each approach ? I'm particularly scared of the impact on the scaling possibilities

One index per client means

  • you duplicate data
  • but it's easier to manage permissions
  • you may end up with heaps of shards which can lead to cluster inefficiencies. ILM can help there though

Array of client ids in document means

  • one set of indices which is easier to manage
  • you can end up with some pretty serious arrays, and updating those will be very hard
  • filters are efficient so queries shouldn't be slowed

Personally I would go for the index per client

I dont intend to update the items, they are inserted once with the array of interested companies (don't know what length it will be) and then this array will not be modified.

Can you tell me more about heaps of shards ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.