I just joined this group today to learn more about ES. We have been using ES for many months and it has been great for our application - making users' comments(including tweets) searchable.
We are running 0.19.2 but for the next version of our application, we plan on 0.19.9.
Here's my problem:
a) we decided to split our ES indices into 'fast' and 'archive' clusters. 'fast' holds latest 'n' comments which get moved to the 'archive' cluster based on LRU policy.
b) An archived comment gets recreated in the fast cluster as a new comment when someone replied to it or modified it. This introduces a duplicate hit when we search across the 2 clusters.
- Delete those comments from archive when they are recreated into the fast cluster. This ensures each comment doc is unique across the 2 clusters. Cons: extra load on the archive cluster(search and delete)
- Post-process the hits and remove dups(this is our current implementation). Cons: we can 'lose' 50% of the total hits unless we replenish with another query (with cursor) but when do we stop. Also this is client-side deduping.
- Get ES to do the deduping at the server side.
a) Any way to get ES to do the deduping? Based on _id field?
b) Any other suggestions?