I just joined this group today to learn more about ES. We have been using ES for many months and it has been great for our application - making users' comments(including tweets) searchable.
We are running 0.19.2 but for the next version of our application, we plan on 0.19.9.
Here's my problem:
a) we decided to split our ES indices into 'fast' and 'archive' clusters. 'fast' holds latest 'n' comments which get moved to the 'archive' cluster based on LRU policy.
b) An archived comment gets recreated in the fast cluster as a new comment when someone replied to it or modified it. This introduces a duplicate hit when we search across the 2 clusters.
Possibe solutions:
Delete those comments from archive when they are recreated into the fast cluster. This ensures each comment doc is unique across the 2 clusters. Cons: extra load on the archive cluster(search and delete)
Post-process the hits and remove dups(this is our current implementation). Cons: we can 'lose' 50% of the total hits unless we replenish with another query (with cursor) but when do we stop. Also this is client-side deduping.
Get ES to do the deduping at the server side.
Questions:
a) Any way to get ES to do the deduping? Based on _id field?
b) Any other suggestions?
Writing a river might solve your requirement of doing things on the server
side ......
In an ES application I recently worked with, the requirement was to apply
some additional analysis on an index and replicate the docs (with
additional info) on a sister index. Now the primary was a growing index,
and a custom river was developed that checks the difference between primary
and secondary once in a certain interval, and processes any data which is
in primary and not in secondary. It helped a lot in moving processing
activity load on the ES server from the client side.
Sujoy.
On Friday, September 21, 2012 6:02:01 AM UTC+5:30, es_learner wrote:
Hello,
I just joined this group today to learn more about ES. We have been using
ES for many months and it has been great for our application - making
users'
comments(including tweets) searchable.
We are running 0.19.2 but for the next version of our application, we plan
on 0.19.9.
Here's my problem:
a) we decided to split our ES indices into 'fast' and 'archive' clusters.
'fast' holds latest 'n' comments which get moved to the 'archive' cluster
based on LRU policy.
b) An archived comment gets recreated in the fast cluster as a new comment
when someone replied to it or modified it. This introduces a duplicate
hit
when we search across the 2 clusters.
Possibe solutions:
Delete those comments from archive when they are recreated into the
fast
cluster. This ensures each comment doc is unique across the 2 clusters.
Cons: extra load on the archive cluster(search and delete)
Post-process the hits and remove dups(this is our current
implementation). Cons: we can 'lose' 50% of the total hits unless we
replenish with another query (with cursor) but when do we stop. Also this
is client-side deduping.
Get ES to do the deduping at the server side.
Questions:
a) Any way to get ES to do the deduping? Based on _id field?
b) Any other suggestions?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.