I have a beginner question regarding a context in which we have to index a PostgreSQL database with Elastic.
Is it required to use an intermediary solution between PostgreSQL and Elastic ? If so what are the best practices to adopt regarding the intermediary solution between PostgreSQL and Elastic (Logstash? Other solution ?) ?
You need to send your existing data to Elasticsearch.
That means:
Read from the database (SELECT * from TABLE)
Convert each record to a JSON Document
Send the json document to Elasticsearch, preferably using the _bulk API.
Logstash can help for that. But I'd recommend modifying the application layer if possible and send data to Elasticsearch in the same "transaction" as you are sending your data to the database.
if your data is small chunck you can use panda's dataframe and do all etl process, convert that to json with one command and use bulk helper load to Elasticsearch
When you write that Logstash can help, does it means that it is also possible to send datas to Elasticsearch through Logstash ? To fully understand, what are the pros and cons for both methods (1-Modification of the application layer / 2-Through Logstash).
You can use an ETL and to read again your database and inject documents in elasticsearch. But you have to think of keeping all the things in sync. For example, when you want to remove an object from the database, you need to deal with that to remove it as well from elasticsearch. You can potentially use a technical table to do that which will contain something like a date, the id of the person which has been modified and the type of action, like upsert or delete .
This way the ETL can read again the same table every x minutes and do what is needed and also remove from this technical table all objects that has been processed so far.
Remember that reading again a database might have a cost on the database especially if you have a complicated model with collections of collections of collections of attributes.
And also, if you need to propose to your user real-time search or near real-time search, that won’t be possible with an ETL which runs every x minutes.
You can send your data to a message queue system like Kafka, RabbitMQ... and then read from it using Logstash (or filebeat), or your own application layer...
1-Modification of the application layer
That's again what I recommend. You have that way a full control of what is happening. On the cons side you have:
deal with transactions. Elasticsearch is not transactional so what happen if something does not work in ES or in the DB.
Related to this: how you make sure things are always in sync'?
you have to write the code: transformation code and writing to ES
2-Through Logstash
Less things to write. And because you are reading a DB, ES will be consistent with what was in the DB when you were reading it.
On the cons side:
What has been updated in the DB? You need to add a timestamp in your objects like the last modification date to make sure you are only fetching the last data and not everything again and again
Deal with deletes. You will need to maintain a deletion table with all the ids that needs to be deleted in ES.
The cost of reading a database again. That could be a huge load on the database. If you compare that with the fact that you have the entity in memory in your app and you just have to serialize it as json while it's there...
real time. Well having a batch running every five minutes might kill the use case. If real time is important for the user, then you have to think about it.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.