Dealing with routing and multiple levels of relationships

Our project involves mirroring part of a database in ElasticSearch.
Relationships need to be mirrored as well and some of the relationships
span across multiple entities. The current approach is to issue an index
operation after every database operation of an entity that needs to be
indexed.

I am already aware that in order for parent/child relationships to work,
both the parent and child document need to reside on the same shard and
that aspect is controlled by using the same parent ID/routing. In the case
of multiple-level relationships, it means that the ID of the top-most
document must always be used as the parent ID/routing.

I understand why such requirement exists on the ElasticSearch side but the
requirement seems impractical for an application that is attempting to
mirror a database. Usually the ID of the top-most entity will not be
available when you update a child entity. So in other words, that
translates to performing a multi-table join database query before every
indexing operation that involves a child document.

One of our main goals with using ElasticSearch is to reduce load on our
main database by off-loading search to ElasticSearch. The additional
querying will add even more load to the main database.

This is what I have tried so far when indexing a child document:

  1. Store/require routing for all type mappings
  2. Retrieve parent in order to figure out the parent's routing
  3. Store child with the parent's routing

In theory this seems that it would work but it has the following problems:

  1. It assumes the parent already exists which may not be the case if the
    index mirroring is done asynchronously
  2. Retrieving the parent
  3. Retrieving the parent using a GET operation requires routing but
    routing is what we are trying to determine in the first place!
    2. Retrieving the parent using a search requires that not only the
    parent already exists but it is available through search. It means that a
    refresh has occurred since the parent was indexed.

Any suggestions on my current approach or different ideas would be greatly
appreciated.

--

I am curious to know how others have addressed multiple-levels of
relationships:

  1. Used only a single shard?
  2. Placed all related document types into the same shard?
  3. Decided not to support multi-level relationships?
  4. Looked up the top-most ID by performing SQL join(s) against the database
    before each index operation?
  5. Adjusted their model in ES to avoid multiple level relationships?

Thanks,
Andreas

On Friday, December 14, 2012 2:45:41 PM UTC-7, Andreas Christoforides wrote:

Our project involves mirroring part of a database in Elasticsearch.
Relationships need to be mirrored as well and some of the relationships
span across multiple entities. The current approach is to issue an index
operation after every database operation of an entity that needs to be
indexed.

I am already aware that in order for parent/child relationships to work,
both the parent and child document need to reside on the same shard and
that aspect is controlled by using the same parent ID/routing. In the case
of multiple-level relationships, it means that the ID of the top-most
document must always be used as the parent ID/routing.

I understand why such requirement exists on the Elasticsearch side but the
requirement seems impractical for an application that is attempting to
mirror a database. Usually the ID of the top-most entity will not be
available when you update a child entity. So in other words, that
translates to performing a multi-table join database query before every
indexing operation that involves a child document.

One of our main goals with using Elasticsearch is to reduce load on our
main database by off-loading search to Elasticsearch. The additional
querying will add even more load to the main database.

This is what I have tried so far when indexing a child document:

  1. Store/require routing for all type mappings
  2. Retrieve parent in order to figure out the parent's routing
  3. Store child with the parent's routing

In theory this seems that it would work but it has the following problems:

  1. It assumes the parent already exists which may not be the case if
    the index mirroring is done asynchronously
  2. Retrieving the parent
  3. Retrieving the parent using a GET operation requires routing but
    routing is what we are trying to determine in the first place!
    2. Retrieving the parent using a search requires that not only the
    parent already exists but it is available through search. It means that a
    refresh has occurred since the parent was indexed.

Any suggestions on my current approach or different ideas would be greatly
appreciated.

--

Hi Andreas,

Supporting relational database like features is a challenge in ES (and
in any other flat storage system).
Here are some notes I have about your initial questions:

  1. I'm not sure how to tackle this one, other then just make sure that
    a consistent routing value is used for the parent and all its
    children. How are you pushing data into ES from your database? How
    often do you refresh your data?
  2. The get operation requires the routing that was used during
    indexing. What you could instead is doing a search (Using a filtered
    query based on the id) instead of a get. This obviously is a more
    expensive operation on the ES side compared to a simple get, but it
    does reduce the database load.

Martijn

On 17 December 2012 08:34, Andreas Christoforides
andreas.christoforides@gmail.com wrote:

I am curious to know how others have addressed multiple-levels of
relationships:

  1. Used only a single shard?
  2. Placed all related document types into the same shard?
  3. Decided not to support multi-level relationships?
  4. Looked up the top-most ID by performing SQL join(s) against the database
    before each index operation?
  5. Adjusted their model in ES to avoid multiple level relationships?

Thanks,
Andreas

On Friday, December 14, 2012 2:45:41 PM UTC-7, Andreas Christoforides wrote:

Our project involves mirroring part of a database in Elasticsearch.
Relationships need to be mirrored as well and some of the relationships span
across multiple entities. The current approach is to issue an index
operation after every database operation of an entity that needs to be
indexed.

I am already aware that in order for parent/child relationships to work,
both the parent and child document need to reside on the same shard and that
aspect is controlled by using the same parent ID/routing. In the case of
multiple-level relationships, it means that the ID of the top-most document
must always be used as the parent ID/routing.

I understand why such requirement exists on the Elasticsearch side but the
requirement seems impractical for an application that is attempting to
mirror a database. Usually the ID of the top-most entity will not be
available when you update a child entity. So in other words, that translates
to performing a multi-table join database query before every indexing
operation that involves a child document.

One of our main goals with using Elasticsearch is to reduce load on our
main database by off-loading search to Elasticsearch. The additional
querying will add even more load to the main database.

This is what I have tried so far when indexing a child document:

Store/require routing for all type mappings
Retrieve parent in order to figure out the parent's routing
Store child with the parent's routing

In theory this seems that it would work but it has the following problems:

It assumes the parent already exists which may not be the case if the
index mirroring is done asynchronously
Retrieving the parent

Retrieving the parent using a GET operation requires routing but routing
is what we are trying to determine in the first place!
Retrieving the parent using a search requires that not only the parent
already exists but it is available through search. It means that a refresh
has occurred since the parent was indexed.

Any suggestions on my current approach or different ideas would be greatly
appreciated.

--

--
Met vriendelijke groet,

Martijn van Groningen

--

Martijn,

Thanks for your reply. Yes, the biggest challenge so far is to choose
routing scheme that works for our relationships while still using multiple
shards.

In terms of how we plan to push data into ES, the current design is to
issue a message into an asynchronous queue after every database update on a
table that is being indexed.
The queue will be consumed by another application/river to place the
document in the index.

The reason we plan to use an asynchronous queue is to avoid blocking the
main application while performing an ES index. We also do not want to tie
our solution too close to the database schema which is why we are not
considering the JDBC river plugin.

Another approach I thought of was to identify relation graphs, give each
graph a unique ID and use that for the routing value. Unfortunately, as we
add more and more types into the index and as our model evolves those
graphs may start connecting which may require expensive and
complicate re-indexing to apply new routing values.

Andreas

On Monday, December 17, 2012 2:18:44 AM UTC-7, Martijn v Groningen wrote:

Hi Andreas,

Supporting relational database like features is a challenge in ES (and
in any other flat storage system).
Here are some notes I have about your initial questions:

  1. I'm not sure how to tackle this one, other then just make sure that
    a consistent routing value is used for the parent and all its
    children. How are you pushing data into ES from your database? How
    often do you refresh your data?
  2. The get operation requires the routing that was used during
    indexing. What you could instead is doing a search (Using a filtered
    query based on the id) instead of a get. This obviously is a more
    expensive operation on the ES side compared to a simple get, but it
    does reduce the database load.

Martijn

On 17 December 2012 08:34, Andreas Christoforides
<andreas.chr...@gmail.com <javascript:>> wrote:

I am curious to know how others have addressed multiple-levels of
relationships:

  1. Used only a single shard?
  2. Placed all related document types into the same shard?
  3. Decided not to support multi-level relationships?
  4. Looked up the top-most ID by performing SQL join(s) against the
    database
    before each index operation?
  5. Adjusted their model in ES to avoid multiple level relationships?

Thanks,
Andreas

On Friday, December 14, 2012 2:45:41 PM UTC-7, Andreas Christoforides
wrote:

Our project involves mirroring part of a database in Elasticsearch.
Relationships need to be mirrored as well and some of the relationships
span
across multiple entities. The current approach is to issue an index
operation after every database operation of an entity that needs to be
indexed.

I am already aware that in order for parent/child relationships to
work,
both the parent and child document need to reside on the same shard and
that
aspect is controlled by using the same parent ID/routing. In the case
of
multiple-level relationships, it means that the ID of the top-most
document
must always be used as the parent ID/routing.

I understand why such requirement exists on the Elasticsearch side but
the
requirement seems impractical for an application that is attempting to
mirror a database. Usually the ID of the top-most entity will not be
available when you update a child entity. So in other words, that
translates
to performing a multi-table join database query before every indexing
operation that involves a child document.

One of our main goals with using Elasticsearch is to reduce load on our
main database by off-loading search to Elasticsearch. The additional
querying will add even more load to the main database.

This is what I have tried so far when indexing a child document:

Store/require routing for all type mappings
Retrieve parent in order to figure out the parent's routing
Store child with the parent's routing

In theory this seems that it would work but it has the following
problems:

It assumes the parent already exists which may not be the case if the
index mirroring is done asynchronously
Retrieving the parent

Retrieving the parent using a GET operation requires routing but
routing
is what we are trying to determine in the first place!
Retrieving the parent using a search requires that not only the parent
already exists but it is available through search. It means that a
refresh
has occurred since the parent was indexed.

Any suggestions on my current approach or different ideas would be
greatly
appreciated.

--

--
Met vriendelijke groet,

Martijn van Groningen

--

On 12/14/2012 1:45 PM, Andreas Christoforides wrote:

In theory this seems that it would work but it has the following problems:

  1. It assumes the parent already exists which may not be the case if
    the index mirroring is done asynchronously

I do not believe a parent has to exist in order to index a child. When
you index a "nested" object everything exists at indez time, but for
parent/child the rules are relaxed
(thus nested queries are faster than has_child queries).

All you need is the ability to calculate or find the right parent ID,
given a child. Because this is the case, I certainly would consider
storing the ES _parent ID in the DB entry for the child, so that when
you receive a child for indexing it knows what the parent ought to be.

-Paul

--