I'm going to paste from Advanced Search for Your Legacy Application - -Xmx128gb -Xms128gb
You can use an ETL and to read again your database and inject documents in Elasticsearch. But you have to think of keeping all the things in sync. For example, when you want to remove an object from the database, you need to deal with that to remove it as well from Elasticsearch. You can potentially use a technical table to do that which will contain something like a date, the id of the person which has been modified and the type of action, like
This way the ETL can read again the same table every x minutes and do what is needed and also remove from this technical table all objects that has been processed so far.
Remember that reading again a database might have a cost on the database especially if you have a complicated model with collections of collections of collections of attributes.
And also, if you need to propose to your user real-time search or near real-time search, that won’t be possible with an ETL which runs every x minutes.
You can send your data to a message queue system like Kafka, RabbitMQ... and then read from it using Logstash (or filebeat), or your own application layer...
1-Modification of the application layer
That's again what I recommend. You have that way a full control of what is happening. On the cons side you have:
- deal with transactions. Elasticsearch is not transactional so what happen if something does not work in ES or in the DB.
- Related to this: how you make sure things are always in sync'?
- you have to write the code: transformation code and writing to ES
Less things to write. And because you are reading a DB, ES will be consistent with what was in the DB when you were reading it.
On the cons side:
- What has been updated in the DB? You need to add a timestamp in your objects like the last modification date to make sure you are only fetching the last data and not everything again and again
- Deal with deletes. You will need to maintain a deletion table with all the ids that needs to be deleted in ES.
- The cost of reading a database again. That could be a huge load on the database. If you compare that with the fact that you have the entity in memory in your app and you just have to serialize it as json while it's there...
- real time. Well having a batch running every five minutes might kill the use case. If real time is important for the user, then you have to think about it.
Just some thoughts...