Reliable term search

Eugene_Strokin · July 18, 2012, 9:11pm

Hello,
I have a problem, which I cannot reproduce locally nor on my DEV
environment, but it is happening on production server.
So, this is kind of mystery, but if someone could have any idea or any
suggestion, I'd greatly appreciate it.

One field of my document is not analyzed string, except it is lower cased.
I need to keep this field unique, but it is not any ID or anything like
this, just a field.
So, when I need to index a new document I check if any document with such
value in this field exists by just running term search with the new value.
It works just fine everywhere. But in production (which is very busy some
times), I've found documents with the same value. I've checked time when it
was indexed, there are days and hours between those documents. I've added
check if all nodes was searched and no errors during that Term Search, and
if so, I don't index a document. But still I see from time to time that
this term search doesn't return a document with the value, and the system
inserts a new document with same value.

I guess, my question: is the term search reliable? Meaning I will always
get a document with the term I'm searching, and there are no errors
indicated in result, or it is possible that search doesn't find the
document for some reason.

Or how to make sure, that my field always unique?

Thanks in advance,
Eugene S.

Clinton_Gormley · July 19, 2012, 8:26am

Hi Eugene

One field of my document is not analyzed string, except it is lower
cased. I need to keep this field unique, but it is not any ID or
anything like this, just a field.
So, when I need to index a new document I check if any document with
such value in this field exists by just running term search with the
new value.
It works just fine everywhere. But in production (which is very busy
some times), I've found documents with the same value. I've checked
time when it was indexed, there are days and hours between those
documents. I've added check if all nodes was searched and no errors
during that Term Search, and if so, I don't index a document. But
still I see from time to time that this term search doesn't return a
document with the value, and the system inserts a new document with
same value.

Note: search is near-real time. By default, search's "view" on the
indexed data is refreshed once every second. So it is quite possible to
have a document which has been indexed, (and which you can GET) but is
not visible to search.

I don't know what refresh interval you have set, but it seems unlikely
that these docs were indexed hours or days before. A term search IS
reliable (although it is possible that you have some other bug in your
code which is interfering with that).

Either way, your approach for managing a unique field is incorrect. You
will always be subject to race conditions.

An approach you can use is similar to what I used in
http://blogs.perl.org/users/clinton_gormley/2011/10/elasticsearchsequence---a-blazing-fast-ticket-server.html

The only unique field in ES is the _id.

So you can have an index whose job it is to maintain a list of unique
values, stored in the _id field.

Eg, let's say you want to make sure that, for field 'my_val', the value
'foo' is not used elsewhere. You can have an index 'unique', with a type
'my_val'.

Try to create a document as:

index: unique
type: myval
id: foo

if it fails, then the value already exists

(Of course, if the doc where you use that value is later deleted, then
you need to delete the unique doc as well)

clint

Eugene_Strokin · July 20, 2012, 3:03am

Thank you Clint, very good advise about separate index for unique values.
I'll try to use it, but before, I'm trying to find a bug in my code, as you
suggested, and the only possible thing I suspect is following:
To make sure I'm receiving complete search result I'm checking

response.getFailedShards()==0

But I also found response.getSuccessfulShards()

I was thinking
that response.getFailedShards()+response.getSuccessfulShards()=response.getTotalShards()
but now I'm not sure.

May be instead of checking response.getFailedShards()==0 I should be
checking response.getSuccessfulShards()==response.getTotalShards().

Could someone please clarify?

Thank you,

Eugene S.

On Thursday, July 19, 2012 4:26:59 AM UTC-4, Clinton Gormley wrote:

Hi Eugene

One field of my document is not analyzed string, except it is lower
cased. I need to keep this field unique, but it is not any ID or
anything like this, just a field.
So, when I need to index a new document I check if any document with
such value in this field exists by just running term search with the
new value.
It works just fine everywhere. But in production (which is very busy
some times), I've found documents with the same value. I've checked
time when it was indexed, there are days and hours between those
documents. I've added check if all nodes was searched and no errors
during that Term Search, and if so, I don't index a document. But
still I see from time to time that this term search doesn't return a
document with the value, and the system inserts a new document with
same value.

Note: search is near-real time. By default, search's "view" on the
indexed data is refreshed once every second. So it is quite possible to
have a document which has been indexed, (and which you can GET) but is
not visible to search.

I don't know what refresh interval you have set, but it seems unlikely
that these docs were indexed hours or days before. A term search IS
reliable (although it is possible that you have some other bug in your
code which is interfering with that).

Either way, your approach for managing a unique field is incorrect. You
will always be subject to race conditions.

An approach you can use is similar to what I used in

ElasticSearch::Sequence - a blazing fast ticket server | Clinton Gormley [blogs.perl.org]

The only unique field in ES is the _id.

So you can have an index whose job it is to maintain a list of unique
values, stored in the _id field.

Eg, let's say you want to make sure that, for field 'my_val', the value
'foo' is not used elsewhere. You can have an index 'unique', with a type
'my_val'.

Try to create a document as:

index: unique

type: myval

id: foo

if it fails, then the value already exists

(Of course, if the doc where you use that value is later deleted, then
you need to delete the unique doc as well)

clint

Clinton_Gormley · July 20, 2012, 8:12am

Hiya

I was thinking that
response.getFailedShards()+response.getSuccessfulShards()=response.getTotalShards() but now I'm not sure.

It should. But really you should only get failed shards in exceptional
circumstances, eg a node is down.

clint