I run a messaging app. My past teammate set up our elasticsearch service for users to search through all their messages - where each document in ES is a message object.
Our staging server cluster has red health and I found that the database is set up for a separate index for each conversation chat. Hence, we have a total of 23,000 shards and most cannot be assigned between the 2 nodes. We use 1 primary shard and 1 replica per index. We only have 0.08Gb in our staging server...
This seems to be a simple issue of too many indices (and therefore too many shards) unnecessarily and I need to refactor it.
The basic functionality we need is to be able to search through each conversation. So if a user has 100 conversations we need to search all of them and show relevant messages from each, grouped by conversation.
Which of these options is a reasonable way to structure this to scale:
(1) Only one index, providing conversation_id as a key in each message doc to query for messages that have one of the 100 conversation_ids.
(2) Randomly assigned indices: Take the conversation_id, hash it and mod 997, assign that as the index. Each conversation then gets randomly assigned one of 997 indices. Querying would require then hashing the 100 conversation ids to find the relevant indices.
(3) Some other option
I'm not familiar enough with ElasticSearch to know the right approach. Is there a problem with having one index? Is there a benefit to splitting up indices? Will I be able to perform a query taking a list similar to SQL function conversation_id IN(conversation_ID1,conversation_ID2,....,conversation_ID100)
For that matter, can ElasticSearch perform any sort of relational queries? For example if the user wants to search for messages "from:Randy" does the name "Randy" have to be in the message document or is it okay if the message document just says "from:1104030" where 1104030 is Randy's ID corresponding to a separate user doc with his name.