Static MySQL Data

banderon1 · April 18, 2012, 11:40pm

I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?

AEvar_Arnfjord_Bjarm · April 19, 2012, 6:02am

On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:

I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?

I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.

Berkay_Mollamustafao · April 19, 2012, 1:03pm

You may want to take a look at Scrutineer for inspiration
https://github.com/Aconex/scrutineer

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Thu, Apr 19, 2012 at 2:02 AM, Ævar Arnfjörð Bjarmason
avarab@gmail.comwrote:

On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:

I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?

I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.

banderon1 · April 20, 2012, 4:14pm

Could you point to some documentation about how to "slurp all that data up
and spew it into Elasticsearch"? I'm a n00b, and have yet to find clear
instructions in the ES documentation. I'm planning on using a PHP client
(Elastica), but if that's not needed, let me know how you are doing it.

Thanks!

On Wednesday, April 18, 2012 11:02:38 PM UTC-7, Ævar Arnfjörð Bjarmason
wrote:

On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:

I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?

I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.

kimchy · April 21, 2012, 3:09pm

Read the data from mysql and use bulk API to index it into elasticsearch.

On Fri, Apr 20, 2012 at 7:14 PM, banderon1 banderon1@gmail.com wrote:

Could you point to some documentation about how to "slurp all that data up
and spew it into Elasticsearch"? I'm a n00b, and have yet to find clear
instructions in the ES documentation. I'm planning on using a PHP client
(Elastica), but if that's not needed, let me know how you are doing it.

Thanks!

On Wednesday, April 18, 2012 11:02:38 PM UTC-7, Ævar Arnfjörð Bjarmason
wrote:

On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:

I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?

I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.

banderon1 · April 23, 2012, 6:21pm

Thank you, I am finally starting to understand how this works! I am able to
create an index and add documents; however, when loading the records in
from MySQL, PHP is crashing due to excessive memory usage. I'm able to dump
the data into a json text file (using SELECT CONCAT()). Can I index this
"static" data to avoid the extra work on the server?

On Saturday, April 21, 2012 8:09:25 AM UTC-7, kimchy wrote:

Read the data from mysql and use bulk API to index it into elasticsearch.

On Fri, Apr 20, 2012 at 7:14 PM, banderon1 banderon1@gmail.com wrote:

Could you point to some documentation about how to "slurp all that data
up and spew it into Elasticsearch"? I'm a n00b, and have yet to find clear
instructions in the ES documentation. I'm planning on using a PHP client
(Elastica), but if that's not needed, let me know how you are doing it.

Thanks!

On Wednesday, April 18, 2012 11:02:38 PM UTC-7, Ævar Arnfjörð Bjarmason
wrote:

On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:

I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?

I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.

Michael_Sick · April 23, 2012, 7:15pm

Would help to know some details on how you're building the request. Which
API are you using (Bulk/Index)? How many requests are you building up
before submitting? ...
--Mike

On Mon, Apr 23, 2012 at 2:21 PM, banderon1 banderon1@gmail.com wrote:

Thank you, I am finally starting to understand how this works! I am able
to create an index and add documents; however, when loading the records in
from MySQL, PHP is crashing due to excessive memory usage. I'm able to dump
the data into a json text file (using SELECT CONCAT()). Can I index this
"static" data to avoid the extra work on the server?

On Saturday, April 21, 2012 8:09:25 AM UTC-7, kimchy wrote:

Read the data from mysql and use bulk API to index it into elasticsearch.

On Fri, Apr 20, 2012 at 7:14 PM, banderon1 banderon1@gmail.com wrote:

Could you point to some documentation about how to "slurp all that data
up and spew it into Elasticsearch"? I'm a n00b, and have yet to find clear
instructions in the ES documentation. I'm planning on using a PHP client
(Elastica), but if that's not needed, let me know how you are doing it.

Thanks!

On Wednesday, April 18, 2012 11:02:38 PM UTC-7, Ævar Arnfjörð Bjarmason
wrote:

On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:

I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?

I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.

AEvar_Arnfjord_Bjarm · April 23, 2012, 7:58pm

On Mon, Apr 23, 2012 at 21:15, Michael Sick
michael.sick@serenesoftware.com wrote:

Would help to know some details on how you're building the request. Which
API are you using (Bulk/Index)? How many requests are you building up before
submitting? ...

In my case the whole thing is:

SELECT some stuff FROM table
Read that from the database with an iterator and buffer up however
many things make sense, in my case I buffer up 1-10k documents
depending on what I'm inserting, but it doesn't really matter that
much. The bulk API is mainly "bulk" in the sense that you can send
a lot of stuff over at once, it doesn't allow Elasticsearch to do
any special optimizations when inserting the data.
If anything fails I try again 10 times sleeping 1..10 seconds in
between the retry, respectively. Due to optimistic concurrency
control you don't even have to parse the response to see which
things were successfully indexed and only retry the ones that
weren't, but you can do it if you're so inclined.

All in all it's like any other program you'd write to move data
between systems. Does that answer your question.

banderon1 · April 23, 2012, 8:06pm

Right now I am using the Elastica PHP client. I am calling addDocuments,
which uses the bulk api. I have over a million 'documents' that need to be
indexed.

On Monday, April 23, 2012 12:15:03 PM UTC-7, Michael Sick wrote:

Would help to know some details on how you're building the request. Which
API are you using (Bulk/Index)? How many requests are you building up
before submitting? ...
--Mike

On Mon, Apr 23, 2012 at 2:21 PM, banderon1 banderon1@gmail.com wrote:

Thank you, I am finally starting to understand how this works! I am able
to create an index and add documents; however, when loading the records in
from MySQL, PHP is crashing due to excessive memory usage. I'm able to dump
the data into a json text file (using SELECT CONCAT()). Can I index this
"static" data to avoid the extra work on the server?

On Saturday, April 21, 2012 8:09:25 AM UTC-7, kimchy wrote:

Read the data from mysql and use bulk API to index it into elasticsearch.

On Fri, Apr 20, 2012 at 7:14 PM, banderon1 banderon1@gmail.com wrote:

Could you point to some documentation about how to "slurp all that data
up and spew it into Elasticsearch"? I'm a n00b, and have yet to find clear
instructions in the ES documentation. I'm planning on using a PHP client
(Elastica), but if that's not needed, let me know how you are doing it.

Thanks!

On Wednesday, April 18, 2012 11:02:38 PM UTC-7, Ævar Arnfjörð Bjarmason
wrote:

On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:

I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job
I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?

I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.

banderon1 · April 23, 2012, 8:24pm

That works, but I'm just trying to optimize the process. Since I already
have the data in a json object, I'm wondering if I can skip MySQL in this
case.

On Monday, April 23, 2012 12:58:34 PM UTC-7, Ævar Arnfjörð Bjarmason wrote:

On Mon, Apr 23, 2012 at 21:15, Michael Sick
michael.sick@serenesoftware.com wrote:

Would help to know some details on how you're building the request. Which
API are you using (Bulk/Index)? How many requests are you building up
before
submitting? ...

In my case the whole thing is:

SELECT some stuff FROM table

Read that from the database with an iterator and buffer up however
many things make sense, in my case I buffer up 1-10k documents
depending on what I'm inserting, but it doesn't really matter that
much. The bulk API is mainly "bulk" in the sense that you can send
a lot of stuff over at once, it doesn't allow Elasticsearch to do
any special optimizations when inserting the data.

If anything fails I try again 10 times sleeping 1..10 seconds in
between the retry, respectively. Due to optimistic concurrency
control you don't even have to parse the response to see which
things were successfully indexed and only retry the ones that
weren't, but you can do it if you're so inclined.

All in all it's like any other program you'd write to move data
between systems. Does that answer your question.

Michael_Sick · April 25, 2012, 12:40pm

If MySQL is your system of record, it seems best to fetch the data from
there since ES doesn't participate in distributed transactions AFAIK and
you may have a consistency issue.

I always follow Ævar's advice above and build in control over the size of
the batch submitted. It's not surprising that grabbing a batch of 1M
anything can challenge memory as it's likely you have a few copies hanging
around as your serializing/deserializing. Using the bulk API seems to help
with routing and performance on the server side but I doubt it's not memory
hungry on the client side as you're building a very large JSON document.

Then I'd try varying the batch size to see what's optimal (10k, 100K, ...)
. If it's likely that the big load scenario is rare and that you're only
indexing some small percent of your data on a daily basis, then I'd
recommend that you only optimize it enough to know that you could do a full
recovery this way if you had to.

--Mike

On Mon, Apr 23, 2012 at 4:24 PM, banderon1 banderon1@gmail.com wrote:

That works, but I'm just trying to optimize the process. Since I already
have the data in a json object, I'm wondering if I can skip MySQL in this
case.

On Monday, April 23, 2012 12:58:34 PM UTC-7, Ævar Arnfjörð Bjarmason wrote:

On Mon, Apr 23, 2012 at 21:15, Michael Sick
<michael.sick@serenesoftware.**com michael.sick@serenesoftware.com>
wrote:

Would help to know some details on how you're building the request.
Which
API are you using (Bulk/Index)? How many requests are you building up
before
submitting? ...

In my case the whole thing is:

SELECT some stuff FROM table

Read that from the database with an iterator and buffer up however
many things make sense, in my case I buffer up 1-10k documents
depending on what I'm inserting, but it doesn't really matter that
much. The bulk API is mainly "bulk" in the sense that you can send
a lot of stuff over at once, it doesn't allow Elasticsearch to do
any special optimizations when inserting the data.

If anything fails I try again 10 times sleeping 1..10 seconds in
between the retry, respectively. Due to optimistic concurrency
control you don't even have to parse the response to see which
things were successfully indexed and only retry the ones that
weren't, but you can do it if you're so inclined.

All in all it's like any other program you'd write to move data
between systems. Does that answer your question.

Topic		Replies	Views
Indexing Mysql Database with elasticsearch - Newbie Elasticsearch	6	3934	July 6, 2017
Syncing Elasticsearch with MySQL database Elasticsearch	2	6136	March 15, 2017
Change data from database Elasticsearch	4	610	September 21, 2020
Complete MySQL database to Elasticsearch Elasticsearch	5	468	August 28, 2018
How do I connect my MySQL database to Elasticsearch (Cloud) Elasticsearch	3	638	July 7, 2020

Static MySQL Data

Related topics