Hi all
I am new to Elasticsearch and currently trying to plan the best setup and avoid driving into a corner.
I have read most of the documentation, and this very informative blogpost: https://www.elastic.co/blog/found-keeping-elasticsearch-in-sync
We have a database of user profiles and what we want is the ability to search based on stuff like gender, height and geo location. The geo location was the demand that triggered me to go look for something like Elasticsearch.
I plan to update the location of active users to MySQL every 5 minutes. Apart from that, users can of cause edit there profile whenever they feel like it.
If I understand the blogpost right, I should register the unique id of a profile every time data, which is also in Elasticsearch, is changed. Lets say I have a separat table for that called es_edits
.
I then write a script to put in a cronjob. In this script I take out 1000 user ids from es_edits
, fetch the data for these, and bulk send this update to ES. Upon success, the 1000 ids is deleted from the es_edits
table.
The script continues to pull 1000 ids until the es_edits
table is empty.
Does this sound reasonable so far?
Another concern is ... how often can/should I run this cronjob? Especially with the geo locations in mind. I would like this information to be up-to-date in ES as soon as possible. But I also don't want to slow ES down with too many updates. What are the pitfalls?
I should mention that we plan to setup ES as a cluster on 3 webservers.
EDIT: I have consulted an ES expert who advised me not to use bulk updates in this case. I was under the impression that one should always use bulk updates, but ES is quite good at handling "on-the-fly" changes to documents. Therefore I have chosen to update whenever a device reports a new location.