We are currently working on an application with a large database and we want to use Elasticsearch as our search engine.
We are rather new working with Elasticsearch and although we have been reading the documentation, some things are still unclear to us, so I apologize in advance in case some of my questions seem trivial.
Currenly we have a lot of data and in the future we will getting new data when the new version gets live.
We have been researching and there doesn't seem to be consensus (that we could find) on the correct or best way to index the data from a SQL Server.
Our findings so far:
- Rivers: deprecated.
- JDBC Feeder: although not deprecated, it hasn't been updated to work with the latest elasticsearch version. We understand JDBC Feeder is a spare time project of Jörg Prante, so most likely he won't be able to keep it up to date as new releases of Elasticsearch arrive. So it is not the best option.
- Logstash with JDBC Plugin: We have been looking into it and we are unsure of how this works. Logstash seems to be way more than we actually need, which is indexing some data into our server. If Item 1 had value A and then has B, i only need to know it has value B, we don't need versioning of the values of Item1, like this example seems to suggest logstash do . Also, if i understand correctly we need to cron a process to update the index, logstash doesn't do that automatically. Correct?
- Custom: Send data to Elasticsearch with custom code, using bulk operations with a process running periodically and keeping track of the changes in the database using a timestamp or some other custom way. This gives us the most freedom, but also it doesn't leverage the knowledge and work of everyone that has been using elasticsearch for this same purpose making our solution not as efficient and maintainable as it could be.
What is the general consensus for this, how do people index data from their database in Elasticsearch? Is there a "correct" way to do it?