Breaking index into small indices with duplicates items across indices

x41lakazam · November 16, 2022, 3:35pm

I am facing a problem that look pretty popular from what I saw on the forum, but always different from mine.

I currently have one big index (actually one big alias split into several ILM indices) which contain data available for all my clients. The fact is some clients may specifically be interested in some items, and I'd like to let those clients (1) retrieve them and (2) make queries only on those items. Currently no item is tagged as "interesting for client X", this is what I want to achieve, and to do so I thought about two approaches.

The first one is to create one index per client and insert into it every document that is potentially interesting for this client. Some items are interesting for many clients, therefore will be duplicated in many of those indices. When a client wants to perform a search on his items of interests I will query his specific index. This approach satisfy me because, as the index will be relatively small, the query will run much faster than in the big index, but I'm hesitating because of the duplicated data, and the overhead so many indices could generate.

The second approach is to add a field in every item that contain the ids of the clients that could potentially be interested in that item (this field contain a list of ids), and then when a client wants to perform a search on his items of interest, I will only append a filter on this field to the query. This approach sounds good to me because no duplication of the item and no creation of a lot of indices is made. But I don't know how fast this will be in comparison with the first approach.

I assume the number of items interesting for many organizations is not negligible. Both of the approaches allow user to keep querying the global data (items that are not potentially interesting), which is one of the constraints I have.

As I'm not myself an elasticsearch expert, I don't know what are the consequences of both solutions. I'd be glad to hear what you think are the good and the bad points of each approach, as well as other interesting approaches.

warkolm · November 16, 2022, 11:07pm

Both have benefits and costs, and at the end of the day they will likely be about the same. TLDR use whichever you think will work the best and you can implement and maintain.

x41lakazam · November 17, 2022, 8:17am

Can you explain me more in details what are the benefits and costs of each approach ? I'm particularly scared of the impact on the scaling possibilities

warkolm · November 17, 2022, 8:21am

One index per client means

you duplicate data
but it's easier to manage permissions
you may end up with heaps of shards which can lead to cluster inefficiencies. ILM can help there though

Array of client ids in document means

one set of indices which is easier to manage
you can end up with some pretty serious arrays, and updating those will be very hard
filters are efficient so queries shouldn't be slowed

Personally I would go for the index per client

x41lakazam · November 17, 2022, 8:33am

I dont intend to update the items, they are inserted once with the array of interested companies (don't know what length it will be) and then this array will not be modified.

Can you tell me more about heaps of shards ?

system · December 15, 2022, 8:34am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Single Index vs. Multiple Indices Elasticsearch	9	4799	November 25, 2018
Multiple small Index vs Single Index Elasticsearch	3	205	March 13, 2024
Should I have one Index or Multiple Indexes Elasticsearch	3	664	March 6, 2022
Need suggestion on sharding for efficiency Elasticsearch	14	1459	July 5, 2017
Many small indices vs one large index? Elasticsearch	10	5463	July 6, 2017

Breaking index into small indices with duplicates items across indices

Related topics