Best practices to handle jdbc (Postgresql) contents?

Hello Everyone,

I have a beginner question regarding a context in which we have to index a PostgreSQL database with Elastic.

Is it required to use an intermediary solution between PostgreSQL and Elastic ? If so what are the best practices to adopt regarding the intermediary solution between PostgreSQL and Elastic (Logstash? Other solution ?) ?

Thanks for any feedback,

Regards

Bienvenue Lucien ! :slight_smile:

You need to send your existing data to Elasticsearch.
That means:

  • Read from the database (SELECT * from TABLE)
  • Convert each record to a JSON Document
  • Send the json document to Elasticsearch, preferably using the _bulk API.

Logstash can help for that. But I'd recommend modifying the application layer if possible and send data to Elasticsearch in the same "transaction" as you are sending your data to the database.

I shared most of my thoughts there: Advanced Search for Your Legacy Application - -Xmx128gb -Xms128gb

Have also a look at this "live coding" recording.

There's also a live demo in french here:

If you know python you can use that.

if your data is small chunck you can use panda's dataframe and do all etl process, convert that to json with one command and use bulk helper load to Elasticsearch

On a similar way, you can use something like Debezium or Hazelcast (which uses Debezium as well AFAIK). We will have an online meetup in 2 weeks in french: Elastic Paris Meetup #67 : hazelcast + kibana = ❤️ pour observer vos données at Elastic Community Events France

And also this one from the same speaker in english: Low-code exploration and visualization of your data at Elastic Community Events EMEA Virtual

Bonjour David :slight_smile:

Thanks a lot for these useful insights !

When you write that Logstash can help, does it means that it is also possible to send datas to Elasticsearch through Logstash ? To fully understand, what are the pros and cons for both methods (1-Modification of the application layer / 2-Through Logstash).

Thank you in advance !

I'm going to paste from https://david.pilato.fr/blog/2015-05-09-advanced-search-for-your-legacy-application/

You can use an ETL and to read again your database and inject documents in elasticsearch. But you have to think of keeping all the things in sync. For example, when you want to remove an object from the database, you need to deal with that to remove it as well from elasticsearch. You can potentially use a technical table to do that which will contain something like a date, the id of the person which has been modified and the type of action, like upsert or delete .

This way the ETL can read again the same table every x minutes and do what is needed and also remove from this technical table all objects that has been processed so far.

Remember that reading again a database might have a cost on the database especially if you have a complicated model with collections of collections of collections of attributes.

And also, if you need to propose to your user real-time search or near real-time search, that won’t be possible with an ETL which runs every x minutes.

You can send your data to a message queue system like Kafka, RabbitMQ... and then read from it using Logstash (or filebeat), or your own application layer...

1-Modification of the application layer

That's again what I recommend. You have that way a full control of what is happening. On the cons side you have:

  • deal with transactions. Elasticsearch is not transactional so what happen if something does not work in ES or in the DB.
  • Related to this: how you make sure things are always in sync'?
  • you have to write the code: transformation code and writing to ES

2-Through Logstash

Less things to write. And because you are reading a DB, ES will be consistent with what was in the DB when you were reading it.

On the cons side:

  • What has been updated in the DB? You need to add a timestamp in your objects like the last modification date to make sure you are only fetching the last data and not everything again and again
  • Deal with deletes. You will need to maintain a deletion table with all the ids that needs to be deleted in ES.
  • The cost of reading a database again. That could be a huge load on the database. If you compare that with the fact that you have the entity in memory in your app and you just have to serialize it as json while it's there...
  • real time. Well having a batch running every five minutes might kill the use case. If real time is important for the user, then you have to think about it.

Just some thoughts...

Thank you so much for this detailed feedback :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.