Join Possibilities for Nested / Parent-Child

Hi,

I'm trying to clear up in my mind, what possibilities I have to create
joins in ElasticSearch Indexes.

I have two Entities:

Contact[id, name, events]  and
Event[id, date, contacts],

that are related, in database terms, as a many-to-many.

I would like to do the following searches:

  1. Query contacts only

  2. Query activities only

  3. Query: contacts where contact.events.date >= today.

  4. Query: activities where activity.contacts.id == 123

  5. Query:

contact.id, contact.name, activity.date
where activity.date > today
and activity.date < today + 1week

(returning all of many rows)

  1. Query:

contact.id, contact.name, activity.date
where activity.date > today
and activity.date < today + 1week

(returning only the top n matches ordered by date)

How much of the above is possible?

Best Regards,

David.

All of those cases are possible and very straightforward with very simple
queries.

When comparing SQL statements to ES queries, the picture is not always
clear.

Facets are used when you get to need to aggregate your results.

There isn't support for a JOIN between two different types, but when you use
a rich document store like ES you wouldn't model your one-to-many as
multiple tables; it would be a single document. In the case of many-to-many
like you describe, if you have a join query, you would most likely be
executing more than one ES query to get a similar result.

Ok thanks,

So looking at a vanilla one-to-many within the rich document model:

Contact ["david", 1]
|-- Event ["25.08.2011", 300]
|-- Event ["28.08.2011", 301]
.. up to 5000 events per contact

Contact ["john", 2]
|-- Event ["25.08.2011", 300]

... say for 20,000 contacts attending the same event (id=300).

A few questions

i) Is the same event Id saved once only, or for each contact it is a
child of?
ii) If the answer to the above is "once only", does that mean that
all 20,000 contacts associated with that event need to be saved on the
same shard as the associated event for the sake of efficiency of
search.
iii) If contacts can have, say 5000, events each one shared with a
different number of 20,000 contacts - then you can get the situation
where all the contacts and all the events have to be saved on the same
shard for the sake of performance

On Aug 24, 7:24 pm, James Cook jc...@tracermedia.com wrote:

All of those cases are possible and very straightforward with very simple
queries.

When comparing SQL statements to ES queries, the picture is not always
clear.

Facets are used when you get to need to aggregate your results.

There isn't support for a JOIN between two different types, but when you use
a rich document store like ES you wouldn't model your one-to-many as
multiple tables; it would be a single document. In the case of many-to-many
like you describe, if you have a join query, you would most likely be
executing more than one ES query to get a similar result.

Your Contact and Event mappings are actually a many-to-many like you stated
originally. I was not suggesting you turn them into a many-to-one
relationship. You would also want to be able to query these collections
bi-directionally.

In your case, I would model it like you would a database.

Event {
_id : ,
name:
location: {
address:
zip:
},
date:
}

Contact {
_id: ,
name: ,
ccard: {
number: ,
ccv: ,
expires:
}
}

EventContactRel {
_id:
contact: { _id: , name: },
event: { _id: , name: , date: }
}

So, you can see the EventContactRel document is a collection of
relationships between Contacts and Events. This would be the same thing you
would do in a relational database, however this table would contain only two
ids columns. We obviously don't have a JOIN syntax, so our data as a result
is not normalized in a similar manner as a relational database.

In the relationship table, I store an "abbreviated" version of my full Event
and Contact documents. I try to put as little information as possible into
my abbreviated objects, but enough to be able to support the queries I need
to perform and the search results I need to display on the screen.

You can also note that this approach can make updating of data more of a
hassle. If the name of my contact changes, or the date of an event moves, I
have to update the abbreviated objects in my relationship tables as well.
Life isn't always a bed of roses outside the relational database. :slight_smile:

-- jim

Hi Jim,

thanks for you detailled reply.

Currently we tend to need all of the fields in the two entities for
all queries.

Is there a possibility to do two/three queries:

i) First on the "join index" to retreive the ids that match.
ii) Then with a list of ids that match (this could be in the tens of
1000s) to get the fields on the "parent" indexes, that can then be
sorted.

Is there a possibility of optimizing the way that ii) is done?

Or would it be a matter of massive denormalisation (using all fields)
with all the required fields for those queries that require some kind
of join?

David.

On Aug 25, 3:04 pm, James Cook jc...@tracermedia.com wrote:

Your Contact and Event mappings are actually a many-to-many like you stated
originally. I was not suggesting you turn them into a many-to-one
relationship. You would also want to be able to query these collections
bi-directionally.

In your case, I would model it like you would a database.

Event {
_id : ,
name:
location: {
address:
zip:
},
date:

}

Contact {
_id: ,
name: ,
ccard: {
number: ,
ccv: ,
expires:
}

}

EventContactRel {
_id:
contact: { _id: , name: },
event: { _id: , name: , date: }

}

So, you can see the EventContactRel document is a collection of
relationships between Contacts and Events. This would be the same thing you
would do in a relational database, however this table would contain only two
ids columns. We obviously don't have a JOIN syntax, so our data as a result
is not normalized in a similar manner as a relational database.

In the relationship table, I store an "abbreviated" version of my full Event
and Contact documents. I try to put as little information as possible into
my abbreviated objects, but enough to be able to support the queries I need
to perform and the search results I need to display on the screen.

You can also note that this approach can make updating of data more of a
hassle. If the name of my contact changes, or the date of an event moves, I
have to update the abbreviated objects in my relationship tables as well.
Life isn't always a bed of roses outside the relational database. :slight_smile:

-- jim

I think Shay, Clinton or someone else would be better to weigh in on the
best way to structure a query that matches on an array containing
potentially thousands of id values.

-- jim

There is the MultiGetRequest APIhttp://www.elasticsearch.org/guide/reference/api/multi-get.htmlcall that can be made. I'm just not sure what its suitability is for
thousands of ids.

-- jim

The many to many one is tricky. You could decide to break it into a one to
many relationship (and use either nested or parent/child mapping), and see
if it fits, but, there is no magic in solving it for document based
databases... . You either pay on the query side doing joins yourself, or
index data "several times" to solve it.

On Fri, Aug 26, 2011 at 4:24 PM, James Cook jcook@tracermedia.com wrote:

There is the MultiGetRequest APIhttp://www.elasticsearch.org/guide/reference/api/multi-get.htmlcall that can be made. I'm just not sure what its suitability is for
thousands of ids.

-- jim

Hi Shay,

I feel ok about breaking the many-to-may into two or several one-to-
manys, but my NoSQL experience is not great and I just want to get an
idea of any restrictions I might need to bear in mind one I do that.
Also, advantages and disadvantages of parent-child and nested
approaches, other than nested having a better performance.

My plan would be to have a Contacts index with nested events, and an
events index with nested contacts, my questions are:

  1. How would nested contacts in the events index be updated and
    synchronized with the "main" contacts index.
  2. Are there any gotchas I should look out for in terms of
    performance, or updates, or queries when setting up two many-to-one
    style indexes in this way.

Also, regarding Jim's point is there a "reasonable" upper limit I
should bear in mind when doing a MultiGetRequest?

Best Regards,

David.

On Aug 26, 4:03 pm, Shay Banon kim...@gmail.com wrote:

The many to many one is tricky. You could decide to break it into a one to
many relationship (and use either nested or parent/child mapping), and see
if it fits, but, there is no magic in solving it for document based
databases... . You either pay on the query side doing joins yourself, or
index data "several times" to solve it.

On Fri, Aug 26, 2011 at 4:24 PM, James Cook jc...@tracermedia.com wrote:

There is the MultiGetRequest APIhttp://www.elasticsearch.org/guide/reference/api/multi-get.htmlcall that can be made. I'm just not sure what its suitability is for
thousands of ids.

-- jim

I would be looking at something similar to this...

My problem involves:

*) One to many with lots of documents (hundreds of thousands) on the one side, and a large size on the many side (a couple of thousand per document on the one side).
*) Many side is time-based so we want to TTL it, I believe best practices involve time-based indexing for the many side.
*) One to many would be suitable for the parent-child relationship, one side gets updated often, many side is write-once.

Any ideas on how to allow join queries here?

*) If we go time-based indexing on the many side we will no longer support the parent-child relationship correct?
*) If we do not go time-based indexing on the many side we will have to pay the price of huge indexes and expensive TTL purging? Any efficient way to do it?
*) If we do manual joins, we might bump into the maximum number of ids issue here again, what is that limit?

Thanks,

Francisco.

Please create your own thread. This one is 4 years old...

Will do!