Best practices to handle jdbc (Postgresql) contents?

lucien.guillermin · October 12, 2021, 1:45pm

Hello Everyone,

I have a beginner question regarding a context in which we have to index a PostgreSQL database with Elastic.

Is it required to use an intermediary solution between PostgreSQL and Elastic ? If so what are the best practices to adopt regarding the intermediary solution between PostgreSQL and Elastic (Logstash? Other solution ?) ?

Thanks for any feedback,

Regards

dadoonet · October 12, 2021, 2:00pm

Bienvenue Lucien !

You need to send your existing data to Elasticsearch.
That means:

Read from the database (SELECT * from TABLE)
Convert each record to a JSON Document
Send the json document to Elasticsearch, preferably using the _bulk API.

Logstash can help for that. But I'd recommend modifying the application layer if possible and send data to Elasticsearch in the same "transaction" as you are sending your data to the database.

I shared most of my thoughts there: Advanced Search for Your Legacy Application - -Xmx128gb -Xms128gb

Have also a look at this "live coding" recording.

There's also a live demo in french here:

elasticforme · October 12, 2021, 3:06pm

If you know python you can use that.

if your data is small chunck you can use panda's dataframe and do all etl process, convert that to json with one command and use bulk helper load to Elasticsearch

dadoonet · October 12, 2021, 3:26pm

On a similar way, you can use something like Debezium or Hazelcast (which uses Debezium as well AFAIK). We will have an online meetup in 2 weeks in french: Elastic Paris Meetup #67 : hazelcast + kibana = ❤️ pour observer vos données at Elastic Community Events France

And also this one from the same speaker in english: Low-code exploration and visualization of your data at Elastic Community Events EMEA Virtual

lucien.guillermin · October 13, 2021, 7:03am

Bonjour David

Thanks a lot for these useful insights !

When you write that Logstash can help, does it means that it is also possible to send datas to Elasticsearch through Logstash ? To fully understand, what are the pros and cons for both methods (1-Modification of the application layer / 2-Through Logstash).

Thank you in advance !

dadoonet · October 13, 2021, 8:12am

I'm going to paste from https://david.pilato.fr/blog/2015-05-09-advanced-search-for-your-legacy-application/

You can use an ETL and to read again your database and inject documents in elasticsearch. But you have to think of keeping all the things in sync. For example, when you want to remove an object from the database, you need to deal with that to remove it as well from elasticsearch. You can potentially use a technical table to do that which will contain something like a date, the id of the person which has been modified and the type of action, like upsert or delete .

This way the ETL can read again the same table every x minutes and do what is needed and also remove from this technical table all objects that has been processed so far.

Remember that reading again a database might have a cost on the database especially if you have a complicated model with collections of collections of collections of attributes.

And also, if you need to propose to your user real-time search or near real-time search, that won’t be possible with an ETL which runs every x minutes.

You can send your data to a message queue system like Kafka, RabbitMQ... and then read from it using Logstash (or filebeat), or your own application layer...

1-Modification of the application layer

That's again what I recommend. You have that way a full control of what is happening. On the cons side you have:

deal with transactions. Elasticsearch is not transactional so what happen if something does not work in ES or in the DB.
Related to this: how you make sure things are always in sync'?
you have to write the code: transformation code and writing to ES

2-Through Logstash

Less things to write. And because you are reading a DB, ES will be consistent with what was in the DB when you were reading it.

On the cons side:

What has been updated in the DB? You need to add a timestamp in your objects like the last modification date to make sure you are only fetching the last data and not everything again and again
Deal with deletes. You will need to maintain a deletion table with all the ids that needs to be deleted in ES.
The cost of reading a database again. That could be a huge load on the database. If you compare that with the fact that you have the entity in memory in your app and you just have to serialize it as json while it's there...
real time. Well having a batch running every five minutes might kill the use case. If real time is important for the user, then you have to think about it.

Just some thoughts...

lucien.guillermin · October 18, 2021, 2:24pm

Thank you so much for this detailed feedback

system · November 15, 2021, 2:25pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Integrate Elasticsearch with PostGres in nodejs app Elasticsearch	4	454	May 14, 2024
Integrating postgresql with elk Elasticsearch	3	4763	July 31, 2018
Best practice on getting data out of RDBMS(PostgreSQL)? Elasticsearch	2	441	July 6, 2017
Ingest data from Relational DB to Elasticsearch Elasticsearch	10	1272	September 30, 2022
Logstash JDBC Input vs ElasticSearch JDBC Importer Elasticsearch	3	2985	April 5, 2017

Best practices to handle jdbc (Postgresql) contents?

Related topics