I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?
On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:
I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?
I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.
You may want to take a look at Scrutineer for inspiration
https://github.com/Aconex/scrutineer
Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype
On Thu, Apr 19, 2012 at 2:02 AM, Ævar Arnfjörð Bjarmason
avarab@gmail.comwrote:
On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:
I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.
Could you point to some documentation about how to "slurp all that data up
and spew it into Elasticsearch"? I'm a n00b, and have yet to find clear
instructions in the ES documentation. I'm planning on using a PHP client
(Elastica), but if that's not needed, let me know how you are doing it.
Thanks!
On Wednesday, April 18, 2012 11:02:38 PM UTC-7, Ævar Arnfjörð Bjarmason
wrote:
On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:
I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.
Read the data from mysql and use bulk API to index it into elasticsearch.
On Fri, Apr 20, 2012 at 7:14 PM, banderon1 banderon1@gmail.com wrote:
Could you point to some documentation about how to "slurp all that data up
and spew it into Elasticsearch"? I'm a n00b, and have yet to find clear
instructions in the ES documentation. I'm planning on using a PHP client
(Elastica), but if that's not needed, let me know how you are doing it.Thanks!
On Wednesday, April 18, 2012 11:02:38 PM UTC-7, Ævar Arnfjörð Bjarmason
wrote:On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:
I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.
Thank you, I am finally starting to understand how this works! I am able to
create an index and add documents; however, when loading the records in
from MySQL, PHP is crashing due to excessive memory usage. I'm able to dump
the data into a json text file (using SELECT CONCAT()). Can I index this
"static" data to avoid the extra work on the server?
On Saturday, April 21, 2012 8:09:25 AM UTC-7, kimchy wrote:
Read the data from mysql and use bulk API to index it into elasticsearch.
On Fri, Apr 20, 2012 at 7:14 PM, banderon1 banderon1@gmail.com wrote:
Could you point to some documentation about how to "slurp all that data
up and spew it into Elasticsearch"? I'm a n00b, and have yet to find clear
instructions in the ES documentation. I'm planning on using a PHP client
(Elastica), but if that's not needed, let me know how you are doing it.Thanks!
On Wednesday, April 18, 2012 11:02:38 PM UTC-7, Ævar Arnfjörð Bjarmason
wrote:On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:
I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.
Would help to know some details on how you're building the request. Which
API are you using (Bulk/Index)? How many requests are you building up
before submitting? ...
--Mike
On Mon, Apr 23, 2012 at 2:21 PM, banderon1 banderon1@gmail.com wrote:
Thank you, I am finally starting to understand how this works! I am able
to create an index and add documents; however, when loading the records in
from MySQL, PHP is crashing due to excessive memory usage. I'm able to dump
the data into a json text file (using SELECT CONCAT()). Can I index this
"static" data to avoid the extra work on the server?On Saturday, April 21, 2012 8:09:25 AM UTC-7, kimchy wrote:
Read the data from mysql and use bulk API to index it into elasticsearch.
On Fri, Apr 20, 2012 at 7:14 PM, banderon1 banderon1@gmail.com wrote:
Could you point to some documentation about how to "slurp all that data
up and spew it into Elasticsearch"? I'm a n00b, and have yet to find clear
instructions in the ES documentation. I'm planning on using a PHP client
(Elastica), but if that's not needed, let me know how you are doing it.Thanks!
On Wednesday, April 18, 2012 11:02:38 PM UTC-7, Ævar Arnfjörð Bjarmason
wrote:On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:
I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.
On Mon, Apr 23, 2012 at 21:15, Michael Sick
michael.sick@serenesoftware.com wrote:
Would help to know some details on how you're building the request. Which
API are you using (Bulk/Index)? How many requests are you building up before
submitting? ...
In my case the whole thing is:
-
SELECT some stuff FROM table
-
Read that from the database with an iterator and buffer up however
many things make sense, in my case I buffer up 1-10k documents
depending on what I'm inserting, but it doesn't really matter that
much. The bulk API is mainly "bulk" in the sense that you can send
a lot of stuff over at once, it doesn't allow Elasticsearch to do
any special optimizations when inserting the data. -
If anything fails I try again 10 times sleeping 1..10 seconds in
between the retry, respectively. Due to optimistic concurrency
control you don't even have to parse the response to see which
things were successfully indexed and only retry the ones that
weren't, but you can do it if you're so inclined.
All in all it's like any other program you'd write to move data
between systems. Does that answer your question.
Right now I am using the Elastica PHP client. I am calling addDocuments,
which uses the bulk api. I have over a million 'documents' that need to be
indexed.
On Monday, April 23, 2012 12:15:03 PM UTC-7, Michael Sick wrote:
Would help to know some details on how you're building the request. Which
API are you using (Bulk/Index)? How many requests are you building up
before submitting? ...
--MikeOn Mon, Apr 23, 2012 at 2:21 PM, banderon1 banderon1@gmail.com wrote:
Thank you, I am finally starting to understand how this works! I am able
to create an index and add documents; however, when loading the records in
from MySQL, PHP is crashing due to excessive memory usage. I'm able to dump
the data into a json text file (using SELECT CONCAT()). Can I index this
"static" data to avoid the extra work on the server?On Saturday, April 21, 2012 8:09:25 AM UTC-7, kimchy wrote:
Read the data from mysql and use bulk API to index it into elasticsearch.
On Fri, Apr 20, 2012 at 7:14 PM, banderon1 banderon1@gmail.com wrote:
Could you point to some documentation about how to "slurp all that data
up and spew it into Elasticsearch"? I'm a n00b, and have yet to find clear
instructions in the ES documentation. I'm planning on using a PHP client
(Elastica), but if that's not needed, let me know how you are doing it.Thanks!
On Wednesday, April 18, 2012 11:02:38 PM UTC-7, Ævar Arnfjörð Bjarmason
wrote:On Thu, Apr 19, 2012 at 01:40, banderon1 banderon1@gmail.com wrote:
I have two tables (each with millions of records) in a mysql database
that I want to be able to index and search. I love the features of
elasticsearch, but I'm not sure of how to approach this. The tables
are pretty static, and change only once per day (based on a cron job
I
run). Can I get some feedback about how I would go about implementing
elasticsearch with this kind of setup?I have a similar setup and I just have a cronjob that slurps all that
data up daily and spews it into Elasticsearch. You can either built a
new index daily (simpler) or update the data in-place.
That works, but I'm just trying to optimize the process. Since I already
have the data in a json object, I'm wondering if I can skip MySQL in this
case.
On Monday, April 23, 2012 12:58:34 PM UTC-7, Ævar Arnfjörð Bjarmason wrote:
On Mon, Apr 23, 2012 at 21:15, Michael Sick
michael.sick@serenesoftware.com wrote:Would help to know some details on how you're building the request. Which
API are you using (Bulk/Index)? How many requests are you building up
before
submitting? ...In my case the whole thing is:
SELECT some stuff FROM table
Read that from the database with an iterator and buffer up however
many things make sense, in my case I buffer up 1-10k documents
depending on what I'm inserting, but it doesn't really matter that
much. The bulk API is mainly "bulk" in the sense that you can send
a lot of stuff over at once, it doesn't allow Elasticsearch to do
any special optimizations when inserting the data.If anything fails I try again 10 times sleeping 1..10 seconds in
between the retry, respectively. Due to optimistic concurrency
control you don't even have to parse the response to see which
things were successfully indexed and only retry the ones that
weren't, but you can do it if you're so inclined.All in all it's like any other program you'd write to move data
between systems. Does that answer your question.
If MySQL is your system of record, it seems best to fetch the data from
there since ES doesn't participate in distributed transactions AFAIK and
you may have a consistency issue.
I always follow Ævar's advice above and build in control over the size of
the batch submitted. It's not surprising that grabbing a batch of 1M
anything can challenge memory as it's likely you have a few copies hanging
around as your serializing/deserializing. Using the bulk API seems to help
with routing and performance on the server side but I doubt it's not memory
hungry on the client side as you're building a very large JSON document.
Then I'd try varying the batch size to see what's optimal (10k, 100K, ...)
. If it's likely that the big load scenario is rare and that you're only
indexing some small percent of your data on a daily basis, then I'd
recommend that you only optimize it enough to know that you could do a full
recovery this way if you had to.
--Mike
On Mon, Apr 23, 2012 at 4:24 PM, banderon1 banderon1@gmail.com wrote:
That works, but I'm just trying to optimize the process. Since I already
have the data in a json object, I'm wondering if I can skip MySQL in this
case.On Monday, April 23, 2012 12:58:34 PM UTC-7, Ævar Arnfjörð Bjarmason wrote:
On Mon, Apr 23, 2012 at 21:15, Michael Sick
<michael.sick@serenesoftware.**com michael.sick@serenesoftware.com>
wrote:Would help to know some details on how you're building the request.
Which
API are you using (Bulk/Index)? How many requests are you building up
before
submitting? ...In my case the whole thing is:
SELECT some stuff FROM table
Read that from the database with an iterator and buffer up however
many things make sense, in my case I buffer up 1-10k documents
depending on what I'm inserting, but it doesn't really matter that
much. The bulk API is mainly "bulk" in the sense that you can send
a lot of stuff over at once, it doesn't allow Elasticsearch to do
any special optimizations when inserting the data.If anything fails I try again 10 times sleeping 1..10 seconds in
between the retry, respectively. Due to optimistic concurrency
control you don't even have to parse the response to see which
things were successfully indexed and only retry the ones that
weren't, but you can do it if you're so inclined.All in all it's like any other program you'd write to move data
between systems. Does that answer your question.