I am creating a new user profiling system. I am using Elasticsearch to store user data for fast search.
User profiles have various data like registration-data, comment-added-data, reply-data, etc.
So whenever the user performs an action, we put user data to Kafka queue, and a Kafka consumer processes user data and index data to Elasticsearch one packet at a time.
We are creating the monthly index and using the last activity time to put users to a particular monthly index.
As we are using the user's last activity time for choosing the index, so user document needs to delete from the previous monthly index and add to the new monthly index.
So every time when we are updating user data, then first we need to get user data via querying all monthly index, then if the user last activity month changes, then we need to delete user doc from the previous monthly index and add it to the new monthly index.
Now, As in Elasticsearch, there is a minimum 1-sec refresh interval and we are processing one packet from Kafka at a time, so when we get two users (same user) packet, then we process 1st packet and insert it to Elasticsearch then while processing 2nd packet we need to check if user doc already existing in the Elasticsearch or not(to check if we need to move user doc to new monthly index), but as these two packet process within a second, so I won't find the previous data in Elasticsearch. so I need to take a 1-second of sleep every time when I am processing new data.
I don't think this is a good approach.
What I thinking is:
pick chunk data from Kafka, process it then take 1-second sleep, and again start processing chunk data then again take 1-second sleep ...
Is this a good approach or not?
Or there is any other solution for this?