Synchronise websites and Elasticsearch

Hello everyone,

Do you have any idea how can I process to synchronise my websites (built with Wordpress, Drupal, Joomla...) with my Elasticsearch?

Thank you in advance.

You can may be try to find a webcrawler but IMO it would be too much generic.
I'd use dedicated connectors.

For example, here is an article for Drupal which can help you: http://redcrackle.com/blog/configuring-drupal-elasticsearch-facet-search-functionality

I hope this helps.

Thank you @dadoonet for your answer, I will read this article. Do you think that is better to use webcrawler/dedicated connector or do something directly in the database (trigger/transaction/logs)?

I always prefer sending the data to elasticsearch within the same "transaction" which saves your data to the database.
I wrote an article about it: http://david.pilato.fr/blog/2015/05/09/advanced-search-for-your-legacy-application/

Another approach could be to reindex all your system every night in another index and then switch the alias but it's far away from real time. I mean that it works well if you don't care about updates in your DB during the day.

The closer you are to the application which is generating the data, the better.
So if you are using Drupal and have a connector for that, you should use it.
Same for other systems.

If you can't do that, because there is no way to extend the application, then yes you can use logstash or elasticsearch-jdbc for that. Note that dealing with updates and deletes could be hard.

HTH

@dadoonet Your article is very interesting :slight_smile:

I would like to set up this process :

  1. Run logstash each * minutes with jdbc plugin in order to do "select * from ..." and store the data in the index_date
  2. When the indexation is finished, I would like to add an alias to my index like "index_current" where my application will do searches.
  3. Next time that logstash will run, I will reproduce the process and create new index_date and when the indexation will be finished I will add an alias "index_current" to my alias and delete the oldest.

Do you think it is a good idea?

I have one question : How can I know when Logstash has finished to collect the data?

Thank you in advance.

Do you think it is a good idea?

Yes.

How can I know when Logstash has finished to collect the data?

I think that Logstash will exit after the end of the job.

Look at the documentation: Jdbc input plugin | Logstash Reference [8.11] | Elastic

You can periodically schedule ingestion using a cron syntax (see schedule setting) or run the query one time to load data into Logstash.

And Jdbc input plugin | Logstash Reference [8.11] | Elastic

So if you don't set schedule your logstash job will end after having processed all the data.

Thank you again @dadoonet for your response, I checked and it's true, Logstash exits at the end of the job :slight_smile:

I think that is one of the best solutions in order to have zero downtime, the other solution as you have mentioned would be to execute Elasticsearch's commands directly from the application at the same time than the database's transactions (Mysql, Oracle...).

Thank you again for your advice @dadoonet