"Index IF NO DUPLICATE" Operation

vineeth_mohan · March 18, 2013, 9:09am

Hi ,

I want to index a feed but then i need to make sure there is no duplicates
for it.
Is this available as a atomic operation at ES side.

My application pumps feeds in parallel fashion to ES. So adding something
like this on the application side will choke its performance badly.

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

spinscale · March 18, 2013, 9:12am

Hey

Maybe the op_type parameter can help in your case (depends if you need the
most up-to-date record in ES)...

On Mon, Mar 18, 2013 at 10:09 AM, Vineeth Mohan
vineethmohan@algotree.comwrote:

Hi ,

I want to index a feed but then i need to make sure there is no duplicates
for it.
Is this available as a atomic operation at ES side.

My application pumps feeds in parallel fashion to ES. So adding something
like this on the application side will choke its performance badly.

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

vineeth_mohan · March 18, 2013, 9:23am

I don't have any notion if the document that needs to be "Update or Index"
nature exist.
If i need to use op_type parameter , i need the ID to a potentially same
document right.
Which i don't have in any way.

Thanks
Vineeth

On Mon, Mar 18, 2013 at 2:42 PM, Alexander Reelsen alr@spinscale.de wrote:

Hey

Maybe the op_type parameter can help in your case (depends if you need the
most up-to-date record in ES)...
Elasticsearch Platform — Find real-time answers at scale | Elastic

On Mon, Mar 18, 2013 at 10:09 AM, Vineeth Mohan <vineethmohan@algotree.com

wrote:
Hi ,

I want to index a feed but then i need to make sure there is no
duplicates for it.
Is this available as a atomic operation at ES side.

My application pumps feeds in parallel fashion to ES. So adding something
like this on the application side will choke its performance badly.

Thanks
       Vineeth
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Clinton_Gormley · March 18, 2013, 9:29am

On Mon, 2013-03-18 at 14:53 +0530, Vineeth Mohan wrote:

I don't have any notion if the document that needs to be "Update or
Index" nature exist.
If i need to use op_type parameter , i need the ID to a potentially
same document right.
Which i don't have in any way.

Given that the only unique key constraint is on the _id, if you want to
detect clashes, then you need to put that info into the _id.

In a Perl module I wrote, I used the _ids from a separate index to apply
a unique constraint to a field other than _id in my main index:

https://metacpan.org/module/ElasticSearchX::UniqueKey

clint

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

vineeth_mohan · March 18, 2013, 9:29am

Here by duplicate , i dont mean duplicate docID but duplicate content in
certain fields.

Thanks
Vineeth

On Mon, Mar 18, 2013 at 2:53 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

I don't have any notion if the document that needs to be "Update or Index"
nature exist.
If i need to use op_type parameter , i need the ID to a potentially same
document right.
Which i don't have in any way.

Thanks
Vineeth

On Mon, Mar 18, 2013 at 2:42 PM, Alexander Reelsen alr@spinscale.dewrote:
Hey

Maybe the op_type parameter can help in your case (depends if you need
the most up-to-date record in ES)...
Elasticsearch Platform — Find real-time answers at scale | Elastic

On Mon, Mar 18, 2013 at 10:09 AM, Vineeth Mohan <
vineethmohan@algotree.com> wrote:
Hi ,

I want to index a feed but then i need to make sure there is no
duplicates for it.
Is this available as a atomic operation at ES side.

My application pumps feeds in parallel fashion to ES. So adding
something like this on the application side will choke its performance
badly.

Thanks
       Vineeth
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

vineeth_mohan · March 18, 2013, 9:38am

So the solution you propose is to make the id from the data fields that i
want as unique and then create a feed using it ?

Thanks
Vineeth

On Mon, Mar 18, 2013 at 2:59 PM, Clinton Gormley clint@traveljury.comwrote:

On Mon, 2013-03-18 at 14:53 +0530, Vineeth Mohan wrote:

I don't have any notion if the document that needs to be "Update or
Index" nature exist.
If i need to use op_type parameter , i need the ID to a potentially
same document right.
Which i don't have in any way.

Given that the only unique key constraint is on the _id, if you want to
detect clashes, then you need to put that info into the _id.

In a Perl module I wrote, I used the _ids from a separate index to apply
a unique constraint to a field other than _id in my main index:

https://metacpan.org/module/ElasticSearchX::UniqueKey

clint

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · March 18, 2013, 9:39am

You have to know the certain field beforehand, collect the values, and
generate a cryptographic key of it (e.g. SHA-256). The key can be
formatted into base64 / hex characters and used for _id. In conjunction
with the "create" op_type and/or external versioning, you can implement
a strategy where you get a response if the document already exists.

I use this in the JDBC river.

Jörg

Am 18.03.13 10:29, schrieb Vineeth Mohan:

Here by duplicate , i dont mean duplicate docID but duplicate content
in certain fields.

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

vineeth_mohan · March 18, 2013, 9:45am

Looks good , but if tomorrow i have to change the unique key function (
Like if some new field has to be considered) it would be tedious.

Thanks
Vineeth

On Mon, Mar 18, 2013 at 3:09 PM, Jörg Prante joergprante@gmail.com wrote:

You have to know the certain field beforehand, collect the values, and
generate a cryptographic key of it (e.g. SHA-256). The key can be formatted
into base64 / hex characters and used for _id. In conjunction with the
"create" op_type and/or external versioning, you can implement a strategy
where you get a response if the document already exists.

I use this in the JDBC river.

Jörg

Am 18.03.13 10:29, schrieb Vineeth Mohan:

Here by duplicate , i dont mean duplicate docID but duplicate content in

certain fields.

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

vineeth_mohan · March 18, 2013, 12:09pm

Again one more doubt.
Lets say if its a duplicate , i want to update the original document
(Increment the occurance field) , is it possible using this logic ?
Thanks
Vineeth

On Mon, Mar 18, 2013 at 3:15 PM, Vineeth Mohan vineethmohan@algotree.comwrote:

Looks good , but if tomorrow i have to change the unique key function (
Like if some new field has to be considered) it would be tedious.

Thanks
Vineeth

On Mon, Mar 18, 2013 at 3:09 PM, Jörg Prante joergprante@gmail.comwrote:

You have to know the certain field beforehand, collect the values, and
generate a cryptographic key of it (e.g. SHA-256). The key can be formatted
into base64 / hex characters and used for _id. In conjunction with the
"create" op_type and/or external versioning, you can implement a strategy
where you get a response if the document already exists.

I use this in the JDBC river.

Jörg

Am 18.03.13 10:29, schrieb Vineeth Mohan:

Here by duplicate , i dont mean duplicate docID but duplicate content in

certain fields.

Thanks
Vineeth

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · March 18, 2013, 2:59pm

The "create" action will create the record if the index+type+id does not
exist, but will reject it if the index+type+id already exists.

For example, a _bulk action-and-meta-data line specifies create as follows:

{ "create" : { "_index" : "fizzbuzz", "_type" : "person", "_id" :
"7214560012" } }

On Monday, March 18, 2013 5:09:00 AM UTC-4, Vineeth Mohan wrote:

Hi ,

I want to index a feed but then i need to make sure there is no duplicates
for it.
Is this available as a atomic operation at ES side.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

vineeth_mohan · March 18, 2013, 3:07pm

What if i want to modify the original feed based on the new feed instead of
rejecting the second feed ?

Thanks
Vineeth

On Mon, Mar 18, 2013 at 8:29 PM, InquiringMind brian.from.fl@gmail.comwrote:

The "create" action will create the record if the index+type+id does not
exist, but will reject it if the index+type+id already exists.

For example, a _bulk action-and-meta-data line specifies create as follows:

{ "create" : { "_index" : "fizzbuzz", "_type" : "person", "_id" : "
7214560012" } }

On Monday, March 18, 2013 5:09:00 AM UTC-4, Vineeth Mohan wrote:

Hi ,

I want to index a feed but then i need to make sure there is no
duplicates for it.
Is this available as a atomic operation at ES side.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

brian_yoder · March 18, 2013, 7:33pm

Well, the version number from the original feed can be passed into the
update. That way, when you computed the updated version and went to write
it back, the version number would be checked and used to ensure that no
other updates were done to that record.

In case of failure due to version number mismatch, you could re-read the
record and re-compute the changes, and then try again.

I believe that would ensure an atomic update. There is no record
lock/read/update/unlock facility, of course. Which is a good thing!

On Monday, March 18, 2013 11:07:37 AM UTC-4, Vineeth Mohan wrote:

What if i want to modify the original feed based on the new feed instead
of rejecting the second feed ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Questions about elasticsearch Elasticsearch	4	280	July 6, 2017
Duplicate/copy index Elasticsearch	2	5669	July 6, 2017
ElasticSearch and duplicate content Elasticsearch	1	280	July 6, 2017
Search in Multiple Index Elasticsearch	5	363	July 6, 2017
Optimisations for aggregation-only requests Elasticsearch	5	408	July 6, 2017

"Index IF NO DUPLICATE" Operation

Related Topics