Is there any way to remove duplicated search result in ES?

yang_ming · January 20, 2014, 9:54am

HI,

I am looking for a way which can remove the duplicated search result in ES,
I am eager to anybody's help.
first, i want to explain the requirement. I have created indexs for three
documents, each index have the unique primary key and the same docid. Such
documents may be published by the same author at different time . if i
search the related documents from ES, i will get three documents, but i
only want the newest one. I need to remove other duplicated documents.

I want to develop custom plugin to implement the requirement ,but finally i
failed , because there is no chance to install my plugin after ES have
collected all search result . Does anyone encountered the same problem？
Some people have met the same problem from the following link.

There is a duplicate filter called DuplicateFilter in lucene, which can
remove duplicate values from search result. Maybe, I can use this filter to
remove the articles having the same author .
Please see the following link.
http://lucene.apache.org/core/4_0_0/sandbox/org/apache/lucene/sandbox/queries/DuplicateFilter.html
but the lucene filter can not used in ES directly .Some people have met the
same problem , and kimchy have given the solution . please take a look at
the following link.
http://elasticsearch-users.115913.n3.nabble.com/Possible-to-use-Lucene-filters-td3375477.html https://webmail.thomsonreuters.com/owa/redir.aspx?C=2A_IIIx6-Ui66Zlw4-WcAPvwYMgU6dAIASNU_x9YA-RCxTb12DtqUV7eTD6S8Jd7PkACpkB9bfg.&URL=http%3A%2F%2Felasticsearch-users.115913.n3.nabble.com%2FPossible-to-use-Lucene-filters-td3375477.html

some people also want to use DuplicateFilter in ES, and have asked kimchy
for help. The following link show the detail .

https://github.com/elasticsearch/elasticsearch/issues/1405 https://webmail.thomsonreuters.com/owa/redir.aspx?C=2A_IIIx6-Ui66Zlw4-WcAPvwYMgU6dAIASNU_x9YA-RCxTb12DtqUV7eTD6S8Jd7PkACpkB9bfg.&URL=https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fissues%2F1405

so, we may have the solution to solve our problem , but it is not the best
one according to kimchy's opinion .

in a word , any of above way is not the perfect solution, does anybody met
the same problem ?

Thanks,
ming

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f21b7fa0-1cd1-4aae-87fa-93fe463f39cc%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · January 20, 2014, 10:45am

It is not true "there is no chance to install my plugin after ES have
collected all search result". You can implement a plugin with an
alternative search action. The issue you have cited is related to
overriding default actions and there is good reason in not allowing that.

The Lucene DuplicateFIlter works on segment level and is not suitable for
index level and not for distributed search.

The basic idea is, if you want the "newest one" of documents, you can sort
docs by timestamp, and pick the first one, ignoring the followers.

You can use aggregations plus filtered queries to issue a series of queries
against an ES index and deduplicate it at client side, using your custom
rules of ordering (e.g. one bucket per author, and pick at most one doc per
author from sorted timestamped result set of a filtered query). Note, this
procedure is very expensive, and does not scale.

The best method is indexing deduplicated data, which is the most preferred
solution, because it is cheap: fetch the list of docs per author from the
original source and index only the one to want to have in search results.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF4MDGjv7_3j%2BOXTSFrkGg%3DWvCcnFeHAABpm12rb%3DWFpw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

yang_ming · January 20, 2014, 2:01pm

Thank you for your rapid reply .

it is true that i can custom my own search action, but i can not override
the default search action .so, it is not what i want.
at indexing time , there are serval listeners to install plugins, but at
searching time there is hardly any listener to extend the search operation
except the search action .
why not provide a opportunity to install my own plugin to extend the search
phase , because it seems to be simple from the source code .

i should give up the solution using the lucene duplicate filter according
to your answer .

it is very useful of your proposal to solve my problem .I will try it
.thank you very much !

在 2014年1月20日星期一UTC+8下午6时45分40秒，Jörg Prante写道：

It is not true "there is no chance to install my plugin after ES have
collected all search result". You can implement a plugin with an
alternative search action. The issue you have cited is related to
overriding default actions and there is good reason in not allowing that.

The Lucene DuplicateFIlter works on segment level and is not suitable for
index level and not for distributed search.

The basic idea is, if you want the "newest one" of documents, you can sort
docs by timestamp, and pick the first one, ignoring the followers.

You can use aggregations plus filtered queries to issue a series of
queries against an ES index and deduplicate it at client side, using your
custom rules of ordering (e.g. one bucket per author, and pick at most one
doc per author from sorted timestamped result set of a filtered query).
Note, this procedure is very expensive, and does not scale.

The best method is indexing deduplicated data, which is the most preferred
solution, because it is cheap: fetch the list of docs per author from the
original source and index only the one to want to have in search results.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c9a702a6-0e69-482e-bcf8-346928f15540%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

yang_ming · January 20, 2014, 2:02pm