We are currently working on an application with a large database and we want to use Elasticsearch as our search engine.
We are rather new working with Elasticsearch and although we have been reading the documentation, some things are still unclear to us, so I apologize in advance in case some of my questions seem trivial.
Currenly we have a lot of data and in the future we will getting new data when the new version gets live.
We have been researching and there doesn't seem to be consensus (that we could find) on the correct or best way to index the data from a SQL Server.
Our findings so far:
Rivers: deprecated.
JDBC Feeder: although not deprecated, it hasn't been updated to work with the latest elasticsearch version. We understand JDBC Feeder is a spare time project of Jörg Prante, so most likely he won't be able to keep it up to date as new releases of Elasticsearch arrive. So it is not the best option.
Logstash with JDBC Plugin: We have been looking into it and we are unsure of how this works. Logstash seems to be way more than we actually need, which is indexing some data into our server. If Item 1 had value A and then has B, i only need to know it has value B, we don't need versioning of the values of Item1, like this example seems to suggest logstash do . Also, if i understand correctly we need to cron a process to update the index, logstash doesn't do that automatically. Correct?
Custom: Send data to Elasticsearch with custom code, using bulk operations with a process running periodically and keeping track of the changes in the database using a timestamp or some other custom way. This gives us the most freedom, but also it doesn't leverage the knowledge and work of everyone that has been using elasticsearch for this same purpose making our solution not as efficient and maintainable as it could be.
What is the general consensus for this, how do people index data from their database in Elasticsearch? Is there a "correct" way to do it?
I worked on a project that did something similar. The thing that we realized after a lot of back and forth is if your data does not contain any 1:N relationship that would require the use of nested datatype inside the ES index then using Logstash with JDBC Plugin is one of the more maintainable solutions. It's quick and easy to set up. Things get complicated if your data in SQL Server has a lot of 1:N or even M:N relationships. In such a complex set up, you might want to consider an approach involving custom code using a queuing mechanism like Kafka...
Hi Dadoonet, I saw your article previously, we couldn't go for that implementation because we wouldn't want to prevent the user to add data because elastic search is down and that could potentially get elasticsearch out of sync with the database and then we would have to use some alternative process to sync the records. Which is basically the general we want to take.
Hi Rory, thanks for you answer. We would have to do a couple of joins in our queries, but we don't need nested types. Would you consider that Logstash is the option best suited for our scenario then?
Instead of writing to elasticsearch, write in a Transactional Message Queue system. I have been using ActiveMQ IIRC.
I also added a database checker which was basically counting the number of entities in both systems. And if needed, I was reindexing a given period of time.
If the tables are correctly indexed in SQL Server then the number of joins should not be a problem. I'd suggest batching the updates via logstash to make sure that the process does not spend too much time querying SQL Server to get the result.
Hi Rory, we are analyzing if we will do a query directly or first denormalize the data with another process. But we have one main concern, how do you go around deleting the data from elasticsearch? How do people usually approach that issue using logstash?
I have been reading posts and one suggestion is to add a status field, so you don't really delete items. Is that the only way? Isn't there a way to make an output in logstash that will tell Elasticsearch to delete items?
Hi Dadoonet, thanks for your answer.
Right now we have another application we will probably update to use elasticsearch. The way the data is being indexed in that application is similar to what Rory suggested using logstash.
My question is, if we don't care so much about the data not being immediately available on the index, why would you refactor a large legacy application, with all the testing that such thing would require, to completely change its search approach?
Are there any other major drawbacks to using Logstash with JDBC for sending the data to Elasticsearch?
We want to do what is best, in terms of performance and maintainability. In this scenario, would you still switch to use the approach you describe in your article?
You can use logstash to delete data from ES. All you need to do is in the output section specify the action as delete as follows: action => "delete" and the document_id that you want to delete.
Thanks a lot! That was one thing that was troubling us, we don't like having many tools/process in place for the same thing. And if you are able to update the deletes through elastic is rather convenient.
Adding your comments to dadoonet's, and our own research we think that in our case the best fitting solution is to use logstash.
Thanks a lot again!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.