MySQL JDBC river: indexing large arrays

I'm trying to index my movie DB into ES using MySQL JDBC river.

The problem is:
there are 3 tables:
movies - has many columns
persons - names of the people who participated in some movie
tags - movie tags

I'm indexing it using such query (it's not exact query, just pseudo-code to explain the problem):

SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN persons
JOIN tags

There are quite many movies, each has many columns and each of movies has something like 10-30 persons and 1000 tags as well.
Thus because of the joins all the movies data is duplicated 10.000-30.000 times in the resulting set.
That leads to a great overload, one indexation takes more than hour, but I need to re-index the data each day.

Is there a way to index arrays without duplicating all the data?
I tried it in that way - split the query into 3 ones:

SELECT id as _id, * FROM movies
SELECT movie_id as _id, name FROM persons
SELECT movie_id as _id, value FROM tags

But these queries overwrite each other instead of updating.

Can anybody help me? Looks like the plugin wasn't designed for such cases and I need to write my own strategy (however I don't write in Java :()

You should rethink your SQL statement because JOIN does not work like this,
the result will be an exponential number of rows.

You have to define conditions so that rows of given tables match.

For example

SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN persons on persons.id = movies....
JOIN tags on tags.id = movies....

Jörg

On Fri, Jan 16, 2015 at 11:34 AM, Stalinko staliniv@gmail.com wrote:

I'm trying to index my movie DB into ES using MySQL JDBC river.

The problem is:
there are 3 tables:
movies - has many columns
persons - names of the people who participated in some movie
tags - movie tags

I'm indexing it using such query (it's not exact query, just pseudo-code to
explain the problem):

SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN persons
JOIN tags

There are quite many movies, each has many columns and each of movies has
something like 10-30 persons and 1000 tags as well.
Thus because of the joins all the movies data is duplicated 10.000-30.000
times in the resulting set.
That leads to a great overload, one indexation takes more than hour, but I
need to re-index the data each day.

Is there a way to index arrays without duplicating all the data?
I tried it in that way - split the query into 3 ones:

SELECT id as _id, * FROM movies
SELECT movie_id as _id, name FROM persons
SELECT movie_id as _id, value FROM tags

But these queries overwrite each other instead of updating.

Can anybody help me? Looks like the plugin wasn't designed for such cases
and I need to write my own strategy (however I don't write in Java :()

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/MySQL-JDBC-river-indexing-large-arrays-tp4069168.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1421404458605-4069168.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHdYnaZ1Q3scosmhDJw%2BqKhLhUdKjhWcd8MifzG6_csDA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Nope, that's not the case.
My SQL was a psedo-code to demonstrate the logic.

In reality it looks more like:

SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN movies_persons mp ON mp.movie_id = m.id
JOIN persons p ON mp.person_id = p.id
JOIN movies_tags mt ON mt.movie_id = m.id
JOIN tags t ON mt.tag_id = t.id

And yes, there is big number of rows anyway because each movie has many persons and tags. There can be 10000 very simple tags, but each tag cause duplicating all the movie record in the result and everything slows down... That's the problem.

You can index each table, and after query, you can set up a second multi
get operation to look up the other documents by their ID.

Jörg

On Fri, Jan 16, 2015 at 1:51 PM, Stalinko staliniv@gmail.com wrote:

Nope, that's not the case.
My SQL was a psedo-code to demonstrate the logic.

In reality it looks more like:

SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN movies_persons mp ON mp.movie_id = m.id
JOIN persons p ON mp.person_id = p.id
JOIN movies_tags mt ON mt.movie_id = m.id
JOIN tags t ON mt.tag_id = t.id

And yes, there is big number of rows anyway because each movie has many
persons and tags. There can be 10000 very simple tags, but each tag cause
duplicating all the movie record in the result and everything slows down...
That's the problem.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/MySQL-JDBC-river-indexing-large-arrays-tp4069168p4069178.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1421412683551-4069178.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE8uag9KJwP0TZM7Un4%3D0Vyc6f5fbrye47FzvME%2BkpN9A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Do you mean creating separate index per each table and then joining them somehow when searching?

If you don't want to denormalize data, yes.

Jörg

On Fri, Jan 16, 2015 at 1:59 PM, Stalinko staliniv@gmail.com wrote:

Do you mean creating separate index per each table and then joining them
somehow when searching?

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/MySQL-JDBC-river-indexing-large-arrays-tp4069168p4069182.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1421413172169-4069182.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFkgcK_zwKgX6U9mDHzMfr2JXyYXnvGqWNY2mmN6yHspA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Yeah (

The best solution for me here would be loading the data in 3 steps:

  1. SELECT id as _id, * FROM movies

  2. SELECT mp.movie_id as _id, p.name
    FROM persons p
    JOIN movies_persons mp ON mp.person_id = p.id

  3. SELECT mt.movie_id as _id, t.value
    FROM tags t
    JOIN movies_tags mt ON mt.tag_id = t.id

but with updating/merging instead of rewriting. It could be achieved using unique "_id" key.
Isn't that possible with ES? I think it would be an interesting option for the next versions of the plugin.

I will take a look, but this merging is exactly why I wrote the row
processing routine in JDBC plugin.

_id of ES doc can not be used, the uniqueness must be provided by SQL
result.

Jörg

On Fri, Jan 16, 2015 at 2:20 PM, Stalinko staliniv@gmail.com wrote:

Yeah (

The best solution for me here would be loading the data in 3 steps:

  1. SELECT id as _id, * FROM movies

  2. SELECT mp.movie_id as _id, p.name
    FROM persons p
    JOIN movies_persons mp ON mp.person_id = p.id

  3. SELECT mt.movie_id as _id, t.value
    FROM tags t
    JOIN movies_tags mt ON mt.tag_id = t.id

but with updating/merging instead of rewriting. It could be achieved using
unique "_id" key.
Isn't that possible with ES? I think it would be an interesting option for
the next versions of the plugin.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/MySQL-JDBC-river-indexing-large-arrays-tp4069168p4069184.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1421414457784-4069184.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGNHpzAAnfuE_bk-NXCz6s2VahhwHJpmABEuA52H0r7ng%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thank you Jörg.

Another option could be parsing some strings into arrays. For example:

SELECT movies.*, GROUP_CONCAT(persons.name), GROUP_CONCAT(tags.value)
FROM movies m
JOIN persons on persons.id = movies....
JOIN tags on tags.id = movies....
GROUP BY m.id

where group-concat fields would be converted into arrays. That way looks to be much more faster than denormalization.