MySQL JDBC river: indexing large arrays

Stalinko · January 16, 2015, 10:34am

I'm trying to index my movie DB into ES using MySQL JDBC river.

The problem is:
there are 3 tables:
movies - has many columns
persons - names of the people who participated in some movie
tags - movie tags

I'm indexing it using such query (it's not exact query, just pseudo-code to explain the problem):

SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN persons
JOIN tags

There are quite many movies, each has many columns and each of movies has something like 10-30 persons and 1000 tags as well.
Thus because of the joins all the movies data is duplicated 10.000-30.000 times in the resulting set.
That leads to a great overload, one indexation takes more than hour, but I need to re-index the data each day.

Is there a way to index arrays without duplicating all the data?
I tried it in that way - split the query into 3 ones:

SELECT id as _id, * FROM movies
SELECT movie_id as _id, name FROM persons
SELECT movie_id as _id, value FROM tags

But these queries overwrite each other instead of updating.

Can anybody help me? Looks like the plugin wasn't designed for such cases and I need to write my own strategy (however I don't write in Java :()

jprante · January 16, 2015, 12:45pm

You should rethink your SQL statement because JOIN does not work like this,
the result will be an exponential number of rows.

You have to define conditions so that rows of given tables match.

For example

SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN persons on persons.id = movies....
JOIN tags on tags.id = movies....

Jörg

On Fri, Jan 16, 2015 at 11:34 AM, Stalinko staliniv@gmail.com wrote:

I'm trying to index my movie DB into ES using MySQL JDBC river.

The problem is:
there are 3 tables:
movies - has many columns
persons - names of the people who participated in some movie
tags - movie tags

I'm indexing it using such query (it's not exact query, just pseudo-code to
explain the problem):

SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN persons
JOIN tags

There are quite many movies, each has many columns and each of movies has
something like 10-30 persons and 1000 tags as well.
Thus because of the joins all the movies data is duplicated 10.000-30.000
times in the resulting set.
That leads to a great overload, one indexation takes more than hour, but I
need to re-index the data each day.

Is there a way to index arrays without duplicating all the data?
I tried it in that way - split the query into 3 ones:

SELECT id as _id, * FROM movies
SELECT movie_id as _id, name FROM persons
SELECT movie_id as _id, value FROM tags

But these queries overwrite each other instead of updating.

Can anybody help me? Looks like the plugin wasn't designed for such cases
and I need to write my own strategy (however I don't write in Java :()

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/MySQL-JDBC-river-indexing-large-arrays-tp4069168.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1421404458605-4069168.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHdYnaZ1Q3scosmhDJw%2BqKhLhUdKjhWcd8MifzG6_csDA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Stalinko · January 16, 2015, 12:51pm

Nope, that's not the case.
My SQL was a psedo-code to demonstrate the logic.

In reality it looks more like:

SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN movies_persons mp ON mp.movie_id = m.id
JOIN persons p ON mp.person_id = p.id
JOIN movies_tags mt ON mt.movie_id = m.id
JOIN tags t ON mt.tag_id = t.id

And yes, there is big number of rows anyway because each movie has many persons and tags. There can be 10000 very simple tags, but each tag cause duplicating all the movie record in the result and everything slows down... That's the problem.

jprante · January 16, 2015, 12:55pm

You can index each table, and after query, you can set up a second multi
get operation to look up the other documents by their ID.

Jörg

On Fri, Jan 16, 2015 at 1:51 PM, Stalinko staliniv@gmail.com wrote:

Nope, that's not the case.
My SQL was a psedo-code to demonstrate the logic.

In reality it looks more like:

SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN movies_persons mp ON mp.movie_id = m.id
JOIN persons p ON mp.person_id = p.id
JOIN movies_tags mt ON mt.movie_id = m.id
JOIN tags t ON mt.tag_id = t.id

And yes, there is big number of rows anyway because each movie has many
persons and tags. There can be 10000 very simple tags, but each tag cause
duplicating all the movie record in the result and everything slows down...
That's the problem.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/MySQL-JDBC-river-indexing-large-arrays-tp4069168p4069178.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1421412683551-4069178.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE8uag9KJwP0TZM7Un4%3D0Vyc6f5fbrye47FzvME%2BkpN9A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Stalinko · January 16, 2015, 12:59pm

Do you mean creating separate index per each table and then joining them somehow when searching?

jprante · January 16, 2015, 1:13pm

If you don't want to denormalize data, yes.

Jörg

On Fri, Jan 16, 2015 at 1:59 PM, Stalinko staliniv@gmail.com wrote:

Do you mean creating separate index per each table and then joining them
somehow when searching?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/MySQL-JDBC-river-indexing-large-arrays-tp4069168p4069182.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1421413172169-4069182.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFkgcK_zwKgX6U9mDHzMfr2JXyYXnvGqWNY2mmN6yHspA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Stalinko · January 16, 2015, 1:20pm

Yeah (

The best solution for me here would be loading the data in 3 steps:

SELECT id as _id, * FROM movies
SELECT mp.movie_id as _id, p.name
FROM persons p
JOIN movies_persons mp ON mp.person_id = p.id
SELECT mt.movie_id as _id, t.value
FROM tags t
JOIN movies_tags mt ON mt.tag_id = t.id

but with updating/merging instead of rewriting. It could be achieved using unique "_id" key.
Isn't that possible with ES? I think it would be an interesting option for the next versions of the plugin.

jprante · January 16, 2015, 2:29pm

I will take a look, but this merging is exactly why I wrote the row
processing routine in JDBC plugin.

_id of ES doc can not be used, the uniqueness must be provided by SQL
result.

Jörg

On Fri, Jan 16, 2015 at 2:20 PM, Stalinko staliniv@gmail.com wrote:

Yeah (

The best solution for me here would be loading the data in 3 steps:

SELECT id as _id, * FROM movies

SELECT mp.movie_id as _id, p.name
FROM persons p
JOIN movies_persons mp ON mp.person_id = p.id

SELECT mt.movie_id as _id, t.value
FROM tags t
JOIN movies_tags mt ON mt.tag_id = t.id

but with updating/merging instead of rewriting. It could be achieved using
unique "_id" key.
Isn't that possible with ES? I think it would be an interesting option for
the next versions of the plugin.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/MySQL-JDBC-river-indexing-large-arrays-tp4069168p4069184.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1421414457784-4069184.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGNHpzAAnfuE_bk-NXCz6s2VahhwHJpmABEuA52H0r7ng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Stalinko · January 16, 2015, 4:04pm

Thank you Jörg.

Another option could be parsing some strings into arrays. For example:

SELECT movies.*, GROUP_CONCAT(persons.name), GROUP_CONCAT(tags.value)
FROM movies m
JOIN persons on persons.id = movies....
JOIN tags on tags.id = movies....
GROUP BY m.id

where group-concat fields would be converted into arrays. That way looks to be much more faster than denormalization.

Topic		Replies	Views
Use jdbc river to index data from two different tables in two different database to the same index Elasticsearch	4	472	July 6, 2017
MySQL River? Elasticsearch	7	380	July 6, 2017
ElasticSearch Version problem Elasticsearch	14	576	July 6, 2017
Keeping ES in sync with MySQL changes - JDBC River? Elasticsearch	8	1357	July 6, 2017
JDBC River Multiple nested objects Elasticsearch	4	439	July 6, 2017

MySQL JDBC river: indexing large arrays

Related topics