Avoiding duplicate documents with versioning


(Thomas) #1

Hi,

First of all congrats for the 1.0 release!! Thumbs up for the aggregation
framework :slight_smile:

I'm trying to build a system which is kind of querying for analytics. I
have a document called event, and I have events of specific type (e.g.
click open etc.) per page. So per page i might have for example an open
event
. The thing is that I might as well take the open event more than
once
, but I want to count it only once. So I use the versioning API and I
provide the same document id having as a result the version to increase.

In my queries I use the _timestamp field to determine the last document
that I counted. But my problem is that since ES reindex the document, it
updates _timestamp so it seems as recent document, and in my queries I
count it again.

Is there a way to simply discard the document if the document with the
same id exists, without stopping the bulk operation of uploading documents?

Thanks
Thomas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/15a8062b-a60c-4c2e-ae41-6dd31b4b360b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Thomas) #2

Just an update,

If we use the op_type=create in the index request, will probably discard
the duplicate document. But, in the case where we do a bulk operation will
it stop the bulk upload? or will generate the error and move on to the next
document?

thanks

On Saturday, 15 February 2014 16:53:20 UTC+2, Thomas wrote:

Hi,

First of all congrats for the 1.0 release!! Thumbs up for the aggregation
framework :slight_smile:

I'm trying to build a system which is kind of querying for analytics. I
have a document called event, and I have events of specific type (e.g.
click open etc.) per page. So per page i might have for example an open
event
. The thing is that I might as well take the open event more than
once
, but I want to count it only once. So I use the versioning API and
I provide the same document id having as a result the version to increase.

In my queries I use the _timestamp field to determine the last document
that I counted. But my problem is that since ES reindex the document, it
updates _timestamp so it seems as recent document, and in my queries I
count it again.

Is there a way to simply discard the document if the document with the
same id exists, without stopping the bulk operation of uploading documents?

Thanks
Thomas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dbf19235-5b76-4a09-8b86-9a0fbf7e8d1c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Thomas) #3

Just for any other people that might find this post useful, finally we
managed to get the expected functionality as described here

Thanks
Thomas

On Saturday, 15 February 2014 16:53:20 UTC+2, Thomas wrote:

Hi,

First of all congrats for the 1.0 release!! Thumbs up for the aggregation
framework :slight_smile:

I'm trying to build a system which is kind of querying for analytics. I
have a document called event, and I have events of specific type (e.g.
click open etc.) per page. So per page i might have for example an open
event
. The thing is that I might as well take the open event more than
once
, but I want to count it only once. So I use the versioning API and
I provide the same document id having as a result the version to increase.

In my queries I use the _timestamp field to determine the last document
that I counted. But my problem is that since ES reindex the document, it
updates _timestamp so it seems as recent document, and in my queries I
count it again.

Is there a way to simply discard the document if the document with the
same id exists, without stopping the bulk operation of uploading documents?

Thanks
Thomas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/49af9451-023c-4c49-9211-255b07ca2191%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Boaz Leskes) #4

Hi Thomas,

Just to confirm - the bulk api will only report that specific doc as failed
and will continue to process all the rest.

Cheers,
Boaz

On Tuesday, February 18, 2014 9:41:01 AM UTC+1, Thomas wrote:

Just for any other people that might find this post useful, finally we
managed to get the expected functionality as described here

Thanks
Thomas

On Saturday, 15 February 2014 16:53:20 UTC+2, Thomas wrote:

Hi,

First of all congrats for the 1.0 release!! Thumbs up for the aggregation
framework :slight_smile:

I'm trying to build a system which is kind of querying for analytics. I
have a document called event, and I have events of specific type (e.g.
click open etc.) per page. So per page i might have for example an open
event
. The thing is that I might as well take the open event more than
once
, but I want to count it only once. So I use the versioning API and
I provide the same document id having as a result the version to increase.

In my queries I use the _timestamp field to determine the last document
that I counted. But my problem is that since ES reindex the document, it
updates _timestamp so it seems as recent document, and in my queries I
count it again.

Is there a way to simply discard the document if the document with the
same id exists, without stopping the bulk operation of uploading documents?

Thanks
Thomas

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9ff5079e-74f3-4b16-983a-59db4648a4fb%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(joningle) #5

Just to clarify, did you get the behavior you wanted by using the create op_type? In looking at the ES code, I would expect the default op_type (INDEX) to not create duplicate documents if the id, type and version are the same. Is this not true?


(system) #6