Scaling index/shard model: Message conversations

I run a messaging app. My past teammate set up our elasticsearch service for users to search through all their messages - where each document in ES is a message object.

Our staging server cluster has red health and I found that the database is set up for a separate index for each conversation chat. Hence, we have a total of 23,000 shards and most cannot be assigned between the 2 nodes. We use 1 primary shard and 1 replica per index. We only have 0.08Gb in our staging server...

This seems to be a simple issue of too many indices (and therefore too many shards) unnecessarily and I need to refactor it.

The basic functionality we need is to be able to search through each conversation. So if a user has 100 conversations we need to search all of them and show relevant messages from each, grouped by conversation.

Which of these options is a reasonable way to structure this to scale:
(1) Only one index, providing conversation_id as a key in each message doc to query for messages that have one of the 100 conversation_ids.
(2) Randomly assigned indices: Take the conversation_id, hash it and mod 997, assign that as the index. Each conversation then gets randomly assigned one of 997 indices. Querying would require then hashing the 100 conversation ids to find the relevant indices.
(3) Some other option

I'm not familiar enough with ElasticSearch to know the right approach. Is there a problem with having one index? Is there a benefit to splitting up indices? Will I be able to perform a query taking a list similar to SQL function conversation_id IN(conversation_ID1,conversation_ID2,....,conversation_ID100)

For that matter, can ElasticSearch perform any sort of relational queries? For example if the user wants to search for messages "from:Randy" does the name "Randy" have to be in the message document or is it okay if the message document just says "from:1104030" where 1104030 is Randy's ID corresponding to a separate user doc with his name.

Hi @Tyler_Weitzman,

sounds like you diagnosed the issue and resolution correctly.

Partitioning data on the user id might be a good idea if all or most searches include a specific user id. However, for searches across user ids, it will not benefit. Definitely worth careful thought. From your description, I would not use conversation id to partition by, since you only know the user id (or name) when searching.

Rather than manually partitioning the id space, you can use routing. This allows you to have a single index with a number of (primary) shards and let Elasticsearch handle the partitioning and routing by simply specifying _routing=<user_id> on every index, get and search request.

But it all depends on how much data you have and your response time requirements. I assume you have more on prod than staging? If you only have 0.08Gb and you do not have very low response time requirements, I would assume just having the data in one index, one shard would be fine.

In either case you should have at least one replica for redundancy.

In addition to that, it might be advisable to move to a 3-node cluster to allow master failures as well.

Elasticsearch does not have joins. Instead you can choose to either include the user name into the messages (which is fine, Elasticsearch will use limited space on that, since there is a lot of duplication) or store it in another index. For the latter, you will have to do two look ups, first to find the user-id based on input criteria and then the message search. The choice depends on how much data is associated with a user and how complicated the "from:" style queries are. From the info available here, I think I would go for just including the from-username into the message.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.