I'm trying to index my movie DB into ES using MySQL JDBC river.
The problem is:
there are 3 tables:
movies - has many columns
persons - names of the people who participated in some movie
tags - movie tags
I'm indexing it using such query (it's not exact query, just pseudo-code to explain the problem):
SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN persons
JOIN tags
There are quite many movies, each has many columns and each of movies has something like 10-30 persons and 1000 tags as well.
Thus because of the joins all the movies data is duplicated 10.000-30.000 times in the resulting set.
That leads to a great overload, one indexation takes more than hour, but I need to re-index the data each day.
Is there a way to index arrays without duplicating all the data?
I tried it in that way - split the query into 3 ones:
SELECT id as _id, * FROM movies
SELECT movie_id as _id, name FROM persons
SELECT movie_id as _id, value FROM tags
But these queries overwrite each other instead of updating.
Can anybody help me? Looks like the plugin wasn't designed for such cases and I need to write my own strategy (however I don't write in Java :()
I'm trying to index my movie DB into ES using MySQL JDBC river.
The problem is:
there are 3 tables:
movies - has many columns
persons - names of the people who participated in some movie
tags - movie tags
I'm indexing it using such query (it's not exact query, just pseudo-code to
explain the problem):
SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN persons
JOIN tags
There are quite many movies, each has many columns and each of movies has
something like 10-30 persons and 1000 tags as well.
Thus because of the joins all the movies data is duplicated 10.000-30.000
times in the resulting set.
That leads to a great overload, one indexation takes more than hour, but I
need to re-index the data each day.
Is there a way to index arrays without duplicating all the data?
I tried it in that way - split the query into 3 ones:
SELECT id as _id, * FROM movies
SELECT movie_id as _id, name FROM persons
SELECT movie_id as _id, value FROM tags
But these queries overwrite each other instead of updating.
Can anybody help me? Looks like the plugin wasn't designed for such cases
and I need to write my own strategy (however I don't write in Java :()
Nope, that's not the case.
My SQL was a psedo-code to demonstrate the logic.
In reality it looks more like:
SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN movies_persons mp ON mp.movie_id = m.id
JOIN persons p ON mp.person_id = p.id
JOIN movies_tags mt ON mt.movie_id = m.id
JOIN tags t ON mt.tag_id = t.id
And yes, there is big number of rows anyway because each movie has many persons and tags. There can be 10000 very simple tags, but each tag cause duplicating all the movie record in the result and everything slows down... That's the problem.
Nope, that's not the case.
My SQL was a psedo-code to demonstrate the logic.
In reality it looks more like:
SELECT movies.*, persons.name, tags.value
FROM movies m
JOIN movies_persons mp ON mp.movie_id = m.id
JOIN persons p ON mp.person_id = p.id
JOIN movies_tags mt ON mt.movie_id = m.id
JOIN tags t ON mt.tag_id = t.id
And yes, there is big number of rows anyway because each movie has many
persons and tags. There can be 10000 very simple tags, but each tag cause
duplicating all the movie record in the result and everything slows down...
That's the problem.
The best solution for me here would be loading the data in 3 steps:
SELECT id as _id, * FROM movies
SELECT mp.movie_id as _id, p.name
FROM persons p
JOIN movies_persons mp ON mp.person_id = p.id
SELECT mt.movie_id as _id, t.value
FROM tags t
JOIN movies_tags mt ON mt.tag_id = t.id
but with updating/merging instead of rewriting. It could be achieved using unique "_id" key.
Isn't that possible with ES? I think it would be an interesting option for the next versions of the plugin.
The best solution for me here would be loading the data in 3 steps:
SELECT id as _id, * FROM movies
SELECT mp.movie_id as _id, p.name
FROM persons p
JOIN movies_persons mp ON mp.person_id = p.id
SELECT mt.movie_id as _id, t.value
FROM tags t
JOIN movies_tags mt ON mt.tag_id = t.id
but with updating/merging instead of rewriting. It could be achieved using
unique "_id" key.
Isn't that possible with ES? I think it would be an interesting option for
the next versions of the plugin.
Another option could be parsing some strings into arrays. For example:
SELECT movies.*, GROUP_CONCAT(persons.name), GROUP_CONCAT(tags.value)
FROM movies m
JOIN persons on persons.id = movies....
JOIN tags on tags.id = movies....
GROUP BY m.id
where group-concat fields would be converted into arrays. That way looks to be much more faster than denormalization.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.