Avoiding duplicate documents with versioning

Thomas_Bolis · February 15, 2014, 2:53pm

Hi,

First of all congrats for the 1.0 release!! Thumbs up for the aggregation
framework

I'm trying to build a system which is kind of querying for analytics. I
have a document called event, and I have events of specific type (e.g.
click open etc.) per page. So per page i might have for example an open
event. The thing is that I might as well take the open event more than
once, but I want to count it only once. So I use the versioning API and I
provide the same document id having as a result the version to increase.

In my queries I use the _timestamp field to determine the last document
that I counted. But my problem is that since ES reindex the document, it
updates _timestamp so it seems as recent document, and in my queries I
count it again.

Is there a way to simply discard the document if the document with the
same id exists, without stopping the bulk operation of uploading documents?

Thanks
Thomas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/15a8062b-a60c-4c2e-ae41-6dd31b4b360b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thomas_Bolis · February 15, 2014, 5:17pm

Just an update,

If we use the op_type=create in the index request, will probably discard
the duplicate document. But, in the case where we do a bulk operation will
it stop the bulk upload? or will generate the error and move on to the next
document?

thanks

On Saturday, 15 February 2014 16:53:20 UTC+2, Thomas wrote:

Hi,

First of all congrats for the 1.0 release!! Thumbs up for the aggregation
framework

I'm trying to build a system which is kind of querying for analytics. I
have a document called event, and I have events of specific type (e.g.
click open etc.) per page. So per page i might have for example an open
event. The thing is that I might as well take the open event more than
once, but I want to count it only once. So I use the versioning API and
I provide the same document id having as a result the version to increase.

In my queries I use the _timestamp field to determine the last document
that I counted. But my problem is that since ES reindex the document, it
updates _timestamp so it seems as recent document, and in my queries I
count it again.

Is there a way to simply discard the document if the document with the
same id exists, without stopping the bulk operation of uploading documents?

Thanks
Thomas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dbf19235-5b76-4a09-8b86-9a0fbf7e8d1c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thomas_Bolis · February 18, 2014, 8:41am

Just for any other people that might find this post useful, finally we
managed to get the expected functionality as described here

Thanks
Thomas

On Saturday, 15 February 2014 16:53:20 UTC+2, Thomas wrote:

Hi,

First of all congrats for the 1.0 release!! Thumbs up for the aggregation
framework

I'm trying to build a system which is kind of querying for analytics. I
have a document called event, and I have events of specific type (e.g.
click open etc.) per page. So per page i might have for example an open
event. The thing is that I might as well take the open event more than
once, but I want to count it only once. So I use the versioning API and
I provide the same document id having as a result the version to increase.

In my queries I use the _timestamp field to determine the last document
that I counted. But my problem is that since ES reindex the document, it
updates _timestamp so it seems as recent document, and in my queries I
count it again.

Is there a way to simply discard the document if the document with the
same id exists, without stopping the bulk operation of uploading documents?

Thanks
Thomas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/49af9451-023c-4c49-9211-255b07ca2191%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Boaz_Leskes · February 18, 2014, 12:19pm

Hi Thomas,

Just to confirm - the bulk api will only report that specific doc as failed
and will continue to process all the rest.

Cheers,
Boaz

On Tuesday, February 18, 2014 9:41:01 AM UTC+1, Thomas wrote:

Just for any other people that might find this post useful, finally we
managed to get the expected functionality as described here

Thanks
Thomas

On Saturday, 15 February 2014 16:53:20 UTC+2, Thomas wrote:

Hi,

First of all congrats for the 1.0 release!! Thumbs up for the aggregation
framework

I'm trying to build a system which is kind of querying for analytics. I
have a document called event, and I have events of specific type (e.g.
click open etc.) per page. So per page i might have for example an open
event. The thing is that I might as well take the open event more than
once, but I want to count it only once. So I use the versioning API and
I provide the same document id having as a result the version to increase.

In my queries I use the _timestamp field to determine the last document
that I counted. But my problem is that since ES reindex the document, it
updates _timestamp so it seems as recent document, and in my queries I
count it again.

Is there a way to simply discard the document if the document with the
same id exists, without stopping the bulk operation of uploading documents?

Thanks
Thomas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9ff5079e-74f3-4b16-983a-59db4648a4fb%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

joningle · May 22, 2014, 4:07pm

Just to clarify, did you get the behavior you wanted by using the create op_type? In looking at the ES code, I would expect the default op_type (INDEX) to not create duplicate documents if the id, type and version are the same. Is this not true?

Topic		Replies	Views
Dealing with duplicate documents Elasticsearch	4	1420	July 5, 2017
Duplicate Issue - document_id, how to prevent overwriting of entries Logstash	6	2966	March 16, 2021
Deduplication filter? Elasticsearch	4	4788	July 6, 2017
Saving "newest" document Elasticsearch	2	330	July 6, 2017
Remove duplicate _id from difference index Elasticsearch	1	357	July 6, 2017

Avoiding duplicate documents with versioning

Related topics