Dealing with routing and multiple levels of relationships

andreasch · December 14, 2012, 9:45pm

Our project involves mirroring part of a database in ElasticSearch.
Relationships need to be mirrored as well and some of the relationships
span across multiple entities. The current approach is to issue an index
operation after every database operation of an entity that needs to be
indexed.

I am already aware that in order for parent/child relationships to work,
both the parent and child document need to reside on the same shard and
that aspect is controlled by using the same parent ID/routing. In the case
of multiple-level relationships, it means that the ID of the top-most
document must always be used as the parent ID/routing.

I understand why such requirement exists on the ElasticSearch side but the
requirement seems impractical for an application that is attempting to
mirror a database. Usually the ID of the top-most entity will not be
available when you update a child entity. So in other words, that
translates to performing a multi-table join database query before every
indexing operation that involves a child document.

One of our main goals with using ElasticSearch is to reduce load on our
main database by off-loading search to ElasticSearch. The additional
querying will add even more load to the main database.

This is what I have tried so far when indexing a child document:

Store/require routing for all type mappings
Retrieve parent in order to figure out the parent's routing
Store child with the parent's routing

In theory this seems that it would work but it has the following problems:

It assumes the parent already exists which may not be the case if the
index mirroring is done asynchronously
Retrieving the parent
Retrieving the parent using a GET operation requires routing but
routing is what we are trying to determine in the first place!
2. Retrieving the parent using a search requires that not only the
parent already exists but it is available through search. It means that a
refresh has occurred since the parent was indexed.

Any suggestions on my current approach or different ideas would be greatly
appreciated.

--

andreasch · December 17, 2012, 7:34am

I am curious to know how others have addressed multiple-levels of
relationships:

Used only a single shard?
Placed all related document types into the same shard?
Decided not to support multi-level relationships?
Looked up the top-most ID by performing SQL join(s) against the database
before each index operation?
Adjusted their model in ES to avoid multiple level relationships?

Thanks,
Andreas

On Friday, December 14, 2012 2:45:41 PM UTC-7, Andreas Christoforides wrote:

Our project involves mirroring part of a database in Elasticsearch.
Relationships need to be mirrored as well and some of the relationships
span across multiple entities. The current approach is to issue an index
operation after every database operation of an entity that needs to be
indexed.

I am already aware that in order for parent/child relationships to work,
both the parent and child document need to reside on the same shard and
that aspect is controlled by using the same parent ID/routing. In the case
of multiple-level relationships, it means that the ID of the top-most
document must always be used as the parent ID/routing.

I understand why such requirement exists on the Elasticsearch side but the
requirement seems impractical for an application that is attempting to
mirror a database. Usually the ID of the top-most entity will not be
available when you update a child entity. So in other words, that
translates to performing a multi-table join database query before every
indexing operation that involves a child document.

One of our main goals with using Elasticsearch is to reduce load on our
main database by off-loading search to Elasticsearch. The additional
querying will add even more load to the main database.

This is what I have tried so far when indexing a child document:

Store/require routing for all type mappings

Retrieve parent in order to figure out the parent's routing

Store child with the parent's routing

In theory this seems that it would work but it has the following problems:

It assumes the parent already exists which may not be the case if
the index mirroring is done asynchronously

Retrieving the parent

Retrieving the parent using a GET operation requires routing but
routing is what we are trying to determine in the first place!
2. Retrieving the parent using a search requires that not only the
parent already exists but it is available through search. It means that a
refresh has occurred since the parent was indexed.

Any suggestions on my current approach or different ideas would be greatly
appreciated.

--

mvg · December 17, 2012, 9:18am

Hi Andreas,

Supporting relational database like features is a challenge in ES (and
in any other flat storage system).
Here are some notes I have about your initial questions:

I'm not sure how to tackle this one, other then just make sure that
a consistent routing value is used for the parent and all its
children. How are you pushing data into ES from your database? How
often do you refresh your data?
The get operation requires the routing that was used during
indexing. What you could instead is doing a search (Using a filtered
query based on the id) instead of a get. This obviously is a more
expensive operation on the ES side compared to a simple get, but it
does reduce the database load.

Martijn

On 17 December 2012 08:34, Andreas Christoforides
andreas.christoforides@gmail.com wrote:

I am curious to know how others have addressed multiple-levels of
relationships:

Used only a single shard?

Placed all related document types into the same shard?

Decided not to support multi-level relationships?

Looked up the top-most ID by performing SQL join(s) against the database
before each index operation?

Adjusted their model in ES to avoid multiple level relationships?

Thanks,
Andreas

On Friday, December 14, 2012 2:45:41 PM UTC-7, Andreas Christoforides wrote:

Our project involves mirroring part of a database in Elasticsearch.
Relationships need to be mirrored as well and some of the relationships span
across multiple entities. The current approach is to issue an index
operation after every database operation of an entity that needs to be
indexed.

I am already aware that in order for parent/child relationships to work,
both the parent and child document need to reside on the same shard and that
aspect is controlled by using the same parent ID/routing. In the case of
multiple-level relationships, it means that the ID of the top-most document
must always be used as the parent ID/routing.

I understand why such requirement exists on the Elasticsearch side but the
requirement seems impractical for an application that is attempting to
mirror a database. Usually the ID of the top-most entity will not be
available when you update a child entity. So in other words, that translates
to performing a multi-table join database query before every indexing
operation that involves a child document.

One of our main goals with using Elasticsearch is to reduce load on our
main database by off-loading search to Elasticsearch. The additional
querying will add even more load to the main database.

This is what I have tried so far when indexing a child document:

Store/require routing for all type mappings
Retrieve parent in order to figure out the parent's routing
Store child with the parent's routing

In theory this seems that it would work but it has the following problems:

It assumes the parent already exists which may not be the case if the
index mirroring is done asynchronously
Retrieving the parent

Retrieving the parent using a GET operation requires routing but routing
is what we are trying to determine in the first place!
Retrieving the parent using a search requires that not only the parent
already exists but it is available through search. It means that a refresh
has occurred since the parent was indexed.

Any suggestions on my current approach or different ideas would be greatly
appreciated.

--

--
Met vriendelijke groet,

Martijn van Groningen

--

andreasch · December 17, 2012, 4:43pm

Martijn,

Thanks for your reply. Yes, the biggest challenge so far is to choose
routing scheme that works for our relationships while still using multiple
shards.

In terms of how we plan to push data into ES, the current design is to
issue a message into an asynchronous queue after every database update on a
table that is being indexed.
The queue will be consumed by another application/river to place the
document in the index.

The reason we plan to use an asynchronous queue is to avoid blocking the
main application while performing an ES index. We also do not want to tie
our solution too close to the database schema which is why we are not
considering the JDBC river plugin.

Another approach I thought of was to identify relation graphs, give each
graph a unique ID and use that for the routing value. Unfortunately, as we
add more and more types into the index and as our model evolves those
graphs may start connecting which may require expensive and
complicate re-indexing to apply new routing values.

Andreas

On Monday, December 17, 2012 2:18:44 AM UTC-7, Martijn v Groningen wrote:

Hi Andreas,

Supporting relational database like features is a challenge in ES (and
in any other flat storage system).
Here are some notes I have about your initial questions:

I'm not sure how to tackle this one, other then just make sure that
a consistent routing value is used for the parent and all its
children. How are you pushing data into ES from your database? How
often do you refresh your data?

The get operation requires the routing that was used during
indexing. What you could instead is doing a search (Using a filtered
query based on the id) instead of a get. This obviously is a more
expensive operation on the ES side compared to a simple get, but it
does reduce the database load.

Martijn

On 17 December 2012 08:34, Andreas Christoforides
<andreas.chr...@gmail.com <javascript:>> wrote:

I am curious to know how others have addressed multiple-levels of
relationships:

Used only a single shard?

Placed all related document types into the same shard?

Decided not to support multi-level relationships?

Looked up the top-most ID by performing SQL join(s) against the
database
before each index operation?

Adjusted their model in ES to avoid multiple level relationships?

Thanks,
Andreas

On Friday, December 14, 2012 2:45:41 PM UTC-7, Andreas Christoforides
wrote:

Our project involves mirroring part of a database in Elasticsearch.
Relationships need to be mirrored as well and some of the relationships
span
across multiple entities. The current approach is to issue an index
operation after every database operation of an entity that needs to be
indexed.

I am already aware that in order for parent/child relationships to
work,
both the parent and child document need to reside on the same shard and
that
aspect is controlled by using the same parent ID/routing. In the case
of
multiple-level relationships, it means that the ID of the top-most
document
must always be used as the parent ID/routing.

I understand why such requirement exists on the Elasticsearch side but
the
requirement seems impractical for an application that is attempting to
mirror a database. Usually the ID of the top-most entity will not be
available when you update a child entity. So in other words, that
translates
to performing a multi-table join database query before every indexing
operation that involves a child document.

One of our main goals with using Elasticsearch is to reduce load on our
main database by off-loading search to Elasticsearch. The additional
querying will add even more load to the main database.

This is what I have tried so far when indexing a child document:

Store/require routing for all type mappings
Retrieve parent in order to figure out the parent's routing
Store child with the parent's routing

In theory this seems that it would work but it has the following
problems:

It assumes the parent already exists which may not be the case if the
index mirroring is done asynchronously
Retrieving the parent

Retrieving the parent using a GET operation requires routing but
routing
is what we are trying to determine in the first place!
Retrieving the parent using a search requires that not only the parent
already exists but it is available through search. It means that a
refresh
has occurred since the parent was indexed.

Any suggestions on my current approach or different ideas would be
greatly
appreciated.

--

--
Met vriendelijke groet,

Martijn van Groningen

--

phill · January 3, 2013, 2:00am

On 12/14/2012 1:45 PM, Andreas Christoforides wrote:

In theory this seems that it would work but it has the following problems:

It assumes the parent already exists which may not be the case if
the index mirroring is done asynchronously

I do not believe a parent has to exist in order to index a child. When
you index a "nested" object everything exists at indez time, but for
parent/child the rules are relaxed
(thus nested queries are faster than has_child queries).

All you need is the ability to calculate or find the right parent ID,
given a child. Because this is the case, I certainly would consider
storing the ES _parent ID in the DB entry for the child, so that when
you receive a child for indexing it knows what the parent ought to be.

-Paul

--

Topic		Replies	Views
Parent/child relations in elasticsearch Elasticsearch	5	515	July 6, 2017
Parent/Child Relationship & Routing a Document HEAD request Elasticsearch	2	690	July 5, 2017
Having grandchildren in index with multiple shards Elasticsearch	4	369	January 11, 2019
Parent/Child functionality Elasticsearch	7	1560	July 6, 2017
Routing in Elasticsearch Elasticsearch	1	311	July 6, 2017

Dealing with routing and multiple levels of relationships

Related topics