My use case : Joining indices?

Hi there,

I have an issue, and I'm considering using ES to solve it.

I will try to simplify the use case at maximum (context = ecommerce) :

I have currently two tables in my mysql database.

  • One is a product table.
  • The other one is a relation table which contains : product_id, customer_id, scoring

That other table is becoming huge (500 millions records ~), and it's beginning to give some performance issues. I've been advised to try moving this to ES, since "it's made to handle huge quantity of data".

So I can move product table to ES, that's not a problem.
I can also move the relation table to ES, that's not a problem either.

However, I need to be able to query the product table with filtering on both table (for instance : get the list of product where product.color = blue and relation.scoring = 2 and relation.customer_id = X). Which requires a joining mechanism. I read that's not an option on ES, and we should use nested fields instead. But with my amount of data, that does not seems wise, each product document would be too heavy, right ?

I also read that using aliases could solve my issue, but after reading the documentation, it seems aliases are meant to be used for indices that have the same schema, but I'm not sure I get it right...

The other option I see is to perform 2 queries, and then merge the results, but it doesn't seem a good solution..

What do you think, can this be done somehow with ES ?

Note : calculating the scoring on the fly (without storing it anywhere) is not an option :confused:

I suppose it depends on the individual situation.

Create two indices and perform 2 query is a good start point.
If the number of the result of the query for product table (ex. product.color = blue) is much beyond 65,536, which is the default maximum number of terms for terms query, you have to try the second option.

Create denormalized single index is another good choice. You can donormalize on the client side, or if the product table is almost fixed, using enrich pipline could be a possible option. But if the product table has huge information, the first 2 index strategy could be better.

Of course there is a solution in between. Just add some frequently queried product information to relation index and query on product table only when some rarely queried information is queried.

If product_id is unique for one row in the relation table, you don't need nested fields. Only additional normal fields are necessary.

2 Likes

Also think and test out combine both table and create one index in ES.

I have a monthly index from last two year (different type of data) and each month I have 100million record.
I can still search on them without any problem.

From that I have created another aggregated index by using groupby on some field.

I bet you might not be interested in very old data either hence you can move around your older data in index where it is not accessed directly when you run search. and only hit them if you don't find that in primary index.

2 Likes

Thank you, both of you !

I didn't know about enrich pipline, thanks for that. It could be usefull for some future use cases ! However, in this particular case that will not be an option for me.

I think I'll give a try with "two indices + merging query response" and "combining both table in one ES indice" and compare performances (write & read).

Thanks again!

The general recommendation is to denormalize data as per the options you have listed as this typically gives the best query performance. Another option could however be to use a parent-child mapping within a single index using the join field type. This provides the kind of relationship you are looking for but naturally comes with some tradeoffs. There will be additional overhead as well as lower query performance. The query syntax will also be more complex as you will need to use has_child and/or has_parent queries.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.