Deduplication filter?

Kshitij_Gupta · January 2, 2015, 4:50am

Hi,

I am working on a system where I index multiple versions of objects by
timestamp. I am including timestamp in the document id so that each version
gets a different document id and is searchable by itself. This works fine.
When I search over a time range, I get matching documents and I deduplicate
the result set only keeping the latest version of each object from the
result set.

Now I would like to use aggregations in elasticsearch for building facets.
But the facet calculation needs to happen after deduplication otherwise the
counts will be inaccurate (objects for which multiple versions matched will
be counted multiple times). Is there a deduplication filter available in
elasticsearch? How do I write one myself?

The only way I could get it to work was in multiple steps. First, query and
get matching document ids. Then, deduplicate based on timestamp. Last,
another query with ids filter listing all the ids from 2nd step to compute
aggregations. But the last step takes even more time than the first query.

Any suggestions? I was thinking that any system that tracks multiple
versions of an object will face this issue.

Thanks,
Kshitij

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/11f983e9-b661-4ae6-83d1-6875bedd8c7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · January 2, 2015, 4:58am

Simplest way might be to push an update to the old versions of the
documents to mark them as old and do aggregations filtering those out.
There isn't a great way to deduplicate, really.

On Thu, Jan 1, 2015 at 11:50 PM, Kshitij Gupta kshitij@vnera.com wrote:

Hi,

I am working on a system where I index multiple versions of objects by
timestamp. I am including timestamp in the document id so that each version
gets a different document id and is searchable by itself. This works fine.
When I search over a time range, I get matching documents and I deduplicate
the result set only keeping the latest version of each object from the
result set.

Now I would like to use aggregations in elasticsearch for building facets.
But the facet calculation needs to happen after deduplication otherwise the
counts will be inaccurate (objects for which multiple versions matched will
be counted multiple times). Is there a deduplication filter available in
elasticsearch? How do I write one myself?

The only way I could get it to work was in multiple steps. First, query
and get matching document ids. Then, deduplicate based on timestamp. Last,
another query with ids filter listing all the ids from 2nd step to compute
aggregations. But the last step takes even more time than the first query.

Any suggestions? I was thinking that any system that tracks multiple
versions of an object will face this issue.

Thanks,
Kshitij

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/11f983e9-b661-4ae6-83d1-6875bedd8c7e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/11f983e9-b661-4ae6-83d1-6875bedd8c7e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2ptmjLZW8Jfuhp%2BPQSekrLwNZAYN%2BhuMyv8BnUT6i-dQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jprante · January 2, 2015, 8:13am

The most scalable method is to index all the documents in timewindow based
indexes and use the latest timewindow index only so no deduplication is
required.

Jörg

On Fri, Jan 2, 2015 at 5:58 AM, Nikolas Everett nik9000@gmail.com wrote:

Simplest way might be to push an update to the old versions of the
documents to mark them as old and do aggregations filtering those out.
There isn't a great way to deduplicate, really.

On Thu, Jan 1, 2015 at 11:50 PM, Kshitij Gupta kshitij@vnera.com wrote:

Hi,

I am working on a system where I index multiple versions of objects by
timestamp. I am including timestamp in the document id so that each version
gets a different document id and is searchable by itself. This works fine.
When I search over a time range, I get matching documents and I deduplicate
the result set only keeping the latest version of each object from the
result set.

Now I would like to use aggregations in elasticsearch for building
facets. But the facet calculation needs to happen after deduplication
otherwise the counts will be inaccurate (objects for which multiple
versions matched will be counted multiple times). Is there a deduplication
filter available in elasticsearch? How do I write one myself?

The only way I could get it to work was in multiple steps. First, query
and get matching document ids. Then, deduplicate based on timestamp. Last,
another query with ids filter listing all the ids from 2nd step to compute
aggregations. But the last step takes even more time than the first query.

Any suggestions? I was thinking that any system that tracks multiple
versions of an object will face this issue.

Thanks,
Kshitij

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/11f983e9-b661-4ae6-83d1-6875bedd8c7e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/11f983e9-b661-4ae6-83d1-6875bedd8c7e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2ptmjLZW8Jfuhp%2BPQSekrLwNZAYN%2BhuMyv8BnUT6i-dQ%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2ptmjLZW8Jfuhp%2BPQSekrLwNZAYN%2BhuMyv8BnUT6i-dQ%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFZ1Us41i%3DzqqA0Dv3i%2BtSyDHZKb7QoS9-RqhtGtRxFqg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Kshitij_Gupta · January 2, 2015, 12:20pm

Good suggestion and we do that already. But we allow querying over a time
range as well which could be in the past. Here multiple "old" versions
might match and we need to pick the latest "old" version.

On Friday, January 2, 2015 10:28:58 AM UTC+5:30, Nikolas Everett wrote:

Simplest way might be to push an update to the old versions of the
documents to mark them as old and do aggregations filtering those out.
There isn't a great way to deduplicate, really.

On Thu, Jan 1, 2015 at 11:50 PM, Kshitij Gupta <ksh...@vnera.com
<javascript:>> wrote:

Hi,

I am working on a system where I index multiple versions of objects by
timestamp. I am including timestamp in the document id so that each version
gets a different document id and is searchable by itself. This works fine.
When I search over a time range, I get matching documents and I deduplicate
the result set only keeping the latest version of each object from the
result set.

Now I would like to use aggregations in elasticsearch for building
facets. But the facet calculation needs to happen after deduplication
otherwise the counts will be inaccurate (objects for which multiple
versions matched will be counted multiple times). Is there a deduplication
filter available in elasticsearch? How do I write one myself?

The only way I could get it to work was in multiple steps. First, query
and get matching document ids. Then, deduplicate based on timestamp. Last,
another query with ids filter listing all the ids from 2nd step to compute
aggregations. But the last step takes even more time than the first query.

Any suggestions? I was thinking that any system that tracks multiple
versions of an object will face this issue.

Thanks,
Kshitij

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/11f983e9-b661-4ae6-83d1-6875bedd8c7e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/11f983e9-b661-4ae6-83d1-6875bedd8c7e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7264abb-fe9b-48d7-ae8c-2b98d33f12d8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Dedup Aggregator Elasticsearch	2	473	November 10, 2017
Avoiding duplicate documents with versioning Elasticsearch	5	411	July 6, 2017
Help with aggregation to identify dups Elasticsearch	3	1079	March 4, 2019
How to handle only one same document in multiple indices (indices are based on everyday) Elasticsearch	2	386	October 21, 2019
Indexing-time document deduplication Elasticsearch	6	2573	July 6, 2017

Deduplication filter?

Related topics