Distributed unique constraints


(Steff) #1

Hi

Are there some way of enforcing unique constraints on documents in ES.
E.g. saying to ES that maximum one document are allowed in an index
where field "key1" and "key2" have simular values. E.g. on the structure
of data in the example on
http://www.elasticsearch.org/guide/reference/api/index_.html, can I
somehow tell ES that if a document with "user=kimchy" and
"post_date=2009-11-15T14:12:12" already have been indexed into the
index, then no other documents with the exact same values for user and
post_date are allowed to be indexed?

It would be nice with such a unique constraint feature across an index,
but to avoid communication overhead among nodes the feature might only
work on the same shard within the specific index (then it is up to the
applicaiton using ES to make sure the documents that might collide with
respect to unique constraint will be routed to the same shard). Are the
any support for unique constraints - on index-level or at least on
shard-level?

Is the "id" of documents at least under unique constraint limitations?
Or are you allowed to have more documents in the same index (or shard)
with the same id-value?

Regards, Per Steffensen


(Shay Banon) #2

There is no support for unique constraints (and probably won't be because of
both the limitation of distributed notion, and non real time search). You
can't have several docs with the same _id on the other hand, and you can
actually index a document with a "create" op_type, which will cause the
indexing to fail if there is already a document indexed under the same _id.

On Mon, Sep 12, 2011 at 10:39 AM, Per Steffensen steff@designware.dkwrote:

Hi

Are there some way of enforcing unique constraints on documents in ES. E.g.
saying to ES that maximum one document are allowed in an index where field
"key1" and "key2" have simular values. E.g. on the structure of data in the
example on http://www.elasticsearch.org/guide/reference/api/index_.
html http://www.elasticsearch.org/guide/reference/api/index_.html, can I
somehow tell ES that if a document with "user=kimchy" and
"post_date=2009-11-15T14:12:**12" already have been indexed into the
index, then no other documents with the exact same values for user and
post_date are allowed to be indexed?

It would be nice with such a unique constraint feature across an index, but
to avoid communication overhead among nodes the feature might only work on
the same shard within the specific index (then it is up to the applicaiton
using ES to make sure the documents that might collide with respect to
unique constraint will be routed to the same shard). Are the any support for
unique constraints - on index-level or at least on shard-level?

Is the "id" of documents at least under unique constraint limitations? Or
are you allowed to have more documents in the same index (or shard) with the
same id-value?

Regards, Per Steffensen


(Steff) #3

Shay Banon skrev:

There is no support for unique constraints (and probably won't be
because of both the limitation of distributed notion, and non real
time search). You can't have several docs with the same _id on the
other hand, and you can actually index a document with a "create"
op_type, which will cause the indexing to fail if there is already a
document indexed under the same _id.
Thanks, Shay. Then in my world there IS unique constraint support - but
only on _id field (no user-defined unique constaints). Can you say
something about the "scope" of that unique constraint on _id - is it per
index or only per shard in the index? If it is per index, I guess the
feature will actually be a scalability-limit, not allowing ES to scale
"to infinity" (but probably very very far) with respect to "number of
nodes involved in serving an specific index". But maybe not, can you say
a little more about how it is implemented, with respect to the
communication needed amoung nodes running shards in the index, in order
to maintain the unique constaint on _id's in the entire index?

You say that I need to add a "create" op_type in order to make the index
operation fail if it violates the unique constaint on _id. I would
expect the index operation to fail anyway - what other possible outcome
is there when an index operation violates the _id unique constaint? What
happens if I try to index a new document with an _id that is already
used by an existing document in the index, and I do not add the "create"
op_type thing that you mention?

On Mon, Sep 12, 2011 at 10:39 AM, Per Steffensen <steff@designware.dk
mailto:steff@designware.dk> wrote:

Hi

Are there some way of enforcing unique constraints on documents in
ES. E.g. saying to ES that maximum one document are allowed in an
index where field "key1" and "key2" have simular values. E.g. on
the structure of data in the example on
http://www.elasticsearch.org/guide/reference/api/index_.html, can
I somehow tell ES that if a document with "user=kimchy" and
"post_date=2009-11-15T14:12:12" already have been indexed into the
index, then no other documents with the exact same values for user
and post_date are allowed to be indexed?

It would be nice with such a unique constraint feature across an
index, but to avoid communication overhead among nodes the feature
might only work on the same shard within the specific index (then
it is up to the applicaiton using ES to make sure the documents
that might collide with respect to unique constraint will be
routed to the same shard). Are the any support for unique
constraints - on index-level or at least on shard-level?

Is the "id" of documents at least under unique constraint
limitations? Or are you allowed to have more documents in the same
index (or shard) with the same id-value?

Regards, Per Steffensen

(Benjamin Dev├Ęze) #4

The _id is unique on a per type basis. So if you have an index twitter with
2 types in it tweet1 and tweet2 you can use the same _id for a document of
type tweet1 and a document of type tweet2.

If you send a request to index a document with an existing _id this will
update the doc.


(Shay Banon) #5

Adding to Benjamin Response inline:

On Mon, Sep 12, 2011 at 11:52 AM, Per Steffensen steff@designware.dkwrote:

**
Shay Banon skrev:

There is no support for unique constraints (and probably won't be because
of both the limitation of distributed notion, and non real time search). You
can't have several docs with the same _id on the other hand, and you can
actually index a document with a "create" op_type, which will cause the
indexing to fail if there is already a document indexed under the same _id.

Thanks, Shay. Then in my world there IS unique constraint support - but
only on _id field (no user-defined unique constaints). Can you say something
about the "scope" of that unique constraint on _id - is it per index or only
per shard in the index? If it is per index, I guess the feature will
actually be a scalability-limit, not allowing ES to scale "to infinity" (but
probably very very far) with respect to "number of nodes involved in serving
an specific index". But maybe not, can you say a little more about how it is
implemented, with respect to the communication needed amoung nodes running
shards in the index, in order to maintain the unique constaint on _id's in
the entire index?

A document unique id is the tuple its type and id. Since a document can't
exists in two shards at the same time, the scope is "index" wise but the
check is shard wise.

You say that I need to add a "create" op_type in order to make the index
operation fail if it violates the unique constaint on _id. I would expect
the index operation to fail anyway - what other possible outcome is there
when an index operation violates the _id unique constaint? What happens if I
try to index a new document with an _id that is already used by an existing
document in the index, and I do not add the "create" op_type thing that you
mention?

Updating the document.

On Mon, Sep 12, 2011 at 10:39 AM, Per Steffensen steff@designware.dkwrote:

Hi

Are there some way of enforcing unique constraints on documents in ES.
E.g. saying to ES that maximum one document are allowed in an index where
field "key1" and "key2" have simular values. E.g. on the structure of data
in the example on
http://www.elasticsearch.org/guide/reference/api/index_.html, can I
somehow tell ES that if a document with "user=kimchy" and
"post_date=2009-11-15T14:12:12" already have been indexed into the index,
then no other documents with the exact same values for user and post_date
are allowed to be indexed?

It would be nice with such a unique constraint feature across an index,
but to avoid communication overhead among nodes the feature might only work
on the same shard within the specific index (then it is up to the
applicaiton using ES to make sure the documents that might collide with
respect to unique constraint will be routed to the same shard). Are the any
support for unique constraints - on index-level or at least on shard-level?

Is the "id" of documents at least under unique constraint limitations? Or
are you allowed to have more documents in the same index (or shard) with the
same id-value?

Regards, Per Steffensen


(Steff) #6

Shay Banon skrev:

Adding to Benjamin Response inline:

On Mon, Sep 12, 2011 at 11:52 AM, Per Steffensen <steff@designware.dk
mailto:steff@designware.dk> wrote:

Shay Banon skrev:
There is no support for unique constraints (and probably won't be
because of both the limitation of distributed notion, and non
real time search). You can't have several docs with the same _id
on the other hand, and you can actually index a document with a
"create" op_type, which will cause the indexing to fail if there
is already a document indexed under the same _id.
Thanks, Shay. Then in my world there IS unique constraint support
- but only on _id field (no user-defined unique constaints). Can
you say something about the "scope" of that unique constraint on
_id - is it per index or only per shard in the index? If it is per
index, I guess the feature will actually be a scalability-limit,
not allowing ES to scale "to infinity" (but probably very very
far) with respect to "number of nodes involved in serving an
specific index". But maybe not, can you say a little more about
how it is implemented, with respect to the communication needed
amoung nodes running shards in the index, in order to maintain the
unique constaint on _id's in the entire index?

A document unique id is the tuple its type and id. Since a document
can't exists in two shards at the same time, the scope is "index" wise
but the check is shard wise.
But for that to work, will that not require for me as a user of ES
(writing applications using ES) to make sure that documents with the
same type/_id is routed to the same shard? Imagine the situation, where
a document with type/_id equal to tweet1/1234 has already been indexed
with a routing value making it go to shard1. Now my app tries to index a
new document with the same type/_id values tweet1/1234, but with a
different routing value making it go to shard2. In order to make sure
that the unique constaint on type/_id is not violated ES needs to ask
all shard (especially shard1) if they already contain a document with
type/_id equal to tweet1/1234 - it is not enough to just ask the target
(shard2) of the new document if it already contains a document with
type/_id equal to tweet1/1234. Or didnt I understand routing correctly?
So basically because type/_id does not uniquely define the shard that
gets to index the document, all shard needs to be contacted when a new
document is indexed, in order to make sure it does not violate the
unique constaint on type/_id.

You say that I need to add a "create" op_type in order to make the
index operation fail if it violates the unique constaint on _id. I
would expect the index operation to fail anyway - what other
possible outcome is there when an index operation violates the _id
unique constaint? What happens if I try to index a new document
with an _id that is already used by an existing document in the
index, and I do not add the "create" op_type thing that you mention?

Updating the document.
Ok, thanks!

On Mon, Sep 12, 2011 at 10:39 AM, Per Steffensen
<steff@designware.dk <mailto:steff@designware.dk>> wrote:

    Hi

    Are there some way of enforcing unique constraints on
    documents in ES. E.g. saying to ES that maximum one document
    are allowed in an index where field "key1" and "key2" have
    simular values. E.g. on the structure of data in the example
    on
    http://www.elasticsearch.org/guide/reference/api/index_.html,
    can I somehow tell ES that if a document with "user=kimchy"
    and "post_date=2009-11-15T14:12:12" already have been indexed
    into the index, then no other documents with the exact same
    values for user and post_date are allowed to be indexed?

    It would be nice with such a unique constraint feature across
    an index, but to avoid communication overhead among nodes the
    feature might only work on the same shard within the specific
    index (then it is up to the applicaiton using ES to make sure
    the documents that might collide with respect to unique
    constraint will be routed to the same shard). Are the any
    support for unique constraints - on index-level or at least
    on shard-level?

    Is the "id" of documents at least under unique constraint
    limitations? Or are you allowed to have more documents in the
    same index (or shard) with the same id-value?

    Regards, Per Steffensen

(Shay Banon) #7

If you use a custom routing value, then you have to make sure you use that
routing value when you want to update the document, yes.

On Mon, Sep 12, 2011 at 12:45 PM, Per Steffensen steff@designware.dkwrote:

**
Shay Banon skrev:

Adding to Benjamin Response inline:

On Mon, Sep 12, 2011 at 11:52 AM, Per Steffensen steff@designware.dkwrote:

Shay Banon skrev:

There is no support for unique constraints (and probably won't be because
of both the limitation of distributed notion, and non real time search). You
can't have several docs with the same _id on the other hand, and you can
actually index a document with a "create" op_type, which will cause the
indexing to fail if there is already a document indexed under the same _id.

Thanks, Shay. Then in my world there IS unique constraint support - but
only on _id field (no user-defined unique constaints). Can you say something
about the "scope" of that unique constraint on _id - is it per index or only
per shard in the index? If it is per index, I guess the feature will
actually be a scalability-limit, not allowing ES to scale "to infinity" (but
probably very very far) with respect to "number of nodes involved in serving
an specific index". But maybe not, can you say a little more about how it is
implemented, with respect to the communication needed amoung nodes running
shards in the index, in order to maintain the unique constaint on _id's in
the entire index?

A document unique id is the tuple its type and id. Since a document can't
exists in two shards at the same time, the scope is "index" wise but the
check is shard wise.

But for that to work, will that not require for me as a user of ES (writing
applications using ES) to make sure that documents with the same type/_id is
routed to the same shard? Imagine the situation, where a document with
type/_id equal to tweet1/1234 has already been indexed with a routing value
making it go to shard1. Now my app tries to index a new document with the
same type/_id values tweet1/1234, but with a different routing value making
it go to shard2. In order to make sure that the unique constaint on type/_id
is not violated ES needs to ask all shard (especially shard1) if they
already contain a document with type/_id equal to tweet1/1234 - it is not
enough to just ask the target (shard2) of the new document if it already
contains a document with type/_id equal to tweet1/1234. Or didnt I
understand routing correctly? So basically because type/_id does not
uniquely define the shard that gets to index the document, all shard needs
to be contacted when a new document is indexed, in order to make sure it
does not violate the unique constaint on type/_id.

You say that I need to add a "create" op_type in order to make the index
operation fail if it violates the unique constaint on _id. I would expect
the index operation to fail anyway - what other possible outcome is there
when an index operation violates the _id unique constaint? What happens if I
try to index a new document with an _id that is already used by an existing
document in the index, and I do not add the "create" op_type thing that you
mention?

Updating the document.

Ok, thanks!

On Mon, Sep 12, 2011 at 10:39 AM, Per Steffensen steff@designware.dkwrote:

Hi

Are there some way of enforcing unique constraints on documents in ES.
E.g. saying to ES that maximum one document are allowed in an index where
field "key1" and "key2" have simular values. E.g. on the structure of data
in the example on
http://www.elasticsearch.org/guide/reference/api/index_.html, can I
somehow tell ES that if a document with "user=kimchy" and
"post_date=2009-11-15T14:12:12" already have been indexed into the index,
then no other documents with the exact same values for user and post_date
are allowed to be indexed?

It would be nice with such a unique constraint feature across an index,
but to avoid communication overhead among nodes the feature might only work
on the same shard within the specific index (then it is up to the
applicaiton using ES to make sure the documents that might collide with
respect to unique constraint will be routed to the same shard). Are the any
support for unique constraints - on index-level or at least on shard-level?

Is the "id" of documents at least under unique constraint limitations? Or
are you allowed to have more documents in the same index (or shard) with the
same id-value?

Regards, Per Steffensen


(Steff) #8

Shay Banon skrev:

If you use a custom routing value, then you have to make sure you use
that routing value when you want to update the document, yes.
Thanks. I would state that very clearly in the documentation about routing.

On Mon, Sep 12, 2011 at 12:45 PM, Per Steffensen <steff@designware.dk
mailto:steff@designware.dk> wrote:

Shay Banon skrev:
Adding to Benjamin Response inline:

On Mon, Sep 12, 2011 at 11:52 AM, Per Steffensen
<steff@designware.dk <mailto:steff@designware.dk>> wrote:

    Shay Banon skrev:
    There is no support for unique constraints (and probably
    won't be because of both the limitation of distributed
    notion, and non real time search). You can't have several
    docs with the same _id on the other hand, and you can
    actually index a document with a "create" op_type, which
    will cause the indexing to fail if there is already a
    document indexed under the same _id.
    Thanks, Shay. Then in my world there IS unique constraint
    support - but only on _id field (no user-defined unique
    constaints). Can you say something about the "scope" of that
    unique constraint on _id - is it per index or only per shard
    in the index? If it is per index, I guess the feature will
    actually be a scalability-limit, not allowing ES to scale "to
    infinity" (but probably very very far) with respect to
    "number of nodes involved in serving an specific index". But
    maybe not, can you say a little more about how it is
    implemented, with respect to the communication needed amoung
    nodes running shards in the index, in order to maintain the
    unique constaint on _id's in the entire index?


A document unique id is the tuple its type and id. Since a
document can't exists in two shards at the same time, the scope
is "index" wise but the check is shard wise.
But for that to work, will that not require for me as a user of ES
(writing applications using ES) to make sure that documents with
the same type/_id is routed to the same shard? Imagine the
situation, where a document with type/_id equal to tweet1/1234 has
already been indexed with a routing value making it go to shard1.
Now my app tries to index a new document with the same type/_id
values tweet1/1234, but with a different routing value making it
go to shard2. In order to make sure that the unique constaint on
type/_id is not violated ES needs to ask all shard (especially
shard1) if they already contain a document with type/_id equal to
tweet1/1234 - it is not enough to just ask the target (shard2) of
the new document if it already contains a document with type/_id
equal to tweet1/1234.  Or didnt I understand routing correctly? So
basically because type/_id does not uniquely define the shard that
gets to index the document, all shard needs to be contacted when a
new document is indexed, in order to make sure it does not violate
the unique constaint on type/_id.
    You say that I need to add a "create" op_type in order to
    make the index operation fail if it violates the unique
    constaint on _id. I would expect the index operation to fail
    anyway - what other possible outcome is there when an index
    operation violates the _id unique constaint? What happens if
    I try to index a new document with an _id that is already
    used by an existing document in the index, and I do not add
    the "create" op_type thing that you mention?


Updating the document.
Ok, thanks!
    On Mon, Sep 12, 2011 at 10:39 AM, Per Steffensen
    <steff@designware.dk <mailto:steff@designware.dk>> wrote:

        Hi

        Are there some way of enforcing unique constraints on
        documents in ES. E.g. saying to ES that maximum one
        document are allowed in an index where field "key1" and
        "key2" have simular values. E.g. on the structure of
        data in the example on
        http://www.elasticsearch.org/guide/reference/api/index_.html,
        can I somehow tell ES that if a document with
        "user=kimchy" and "post_date=2009-11-15T14:12:12"
        already have been indexed into the index, then no other
        documents with the exact same values for user and
        post_date are allowed to be indexed?

        It would be nice with such a unique constraint feature
        across an index, but to avoid communication overhead
        among nodes the feature might only work on the same
        shard within the specific index (then it is up to the
        applicaiton using ES to make sure the documents that
        might collide with respect to unique constraint will be
        routed to the same shard). Are the any support for
        unique constraints - on index-level or at least on
        shard-level?

        Is the "id" of documents at least under unique
        constraint limitations? Or are you allowed to have more
        documents in the same index (or shard) with the same
        id-value?

        Regards, Per Steffensen

(onejigtwo) #9

Would it thus make sense to generate an id for the document composed of its field values that would uniquely identify it from other documents. For example, the id of a document that defines a location would have an id that could possibly be a string concatenation or encoded concatenation of (1) the country, (2) the city, (3) the longitude and latitude, and perhaps (4) a unique name? Would this be considered bad practice?


(system) #10