What code should I place into logstash.conf to automatically delete records with certain unique IDs from certain ES index?

Hi,

I have an ES index that is filled with data by logstash with use of jdbc driver. Data source is SQL table1.
Also, I have a SQL table2 that stores list of IDs that were deleted from SQL table1.
And now I want to tell logstash that within each "sync" session (I mean sync between SQL table1 and ES index) he should get list of IDs from table2, then find documents with these IDs in ES index and delete them in ES index.

What code should I place into logstash.conf to automatically delete records from certain ES index as described above?

Logstash processes filters in the configured order but apart from that it has no notion of sequenced actions.

@magnusbaeck
Thanks for quick reply. In that case, is there any other way how to delete documents from index by source list of IDs?
Maybe this way:

  • Create index2 that will be synced with the SQL table2 that stores deleted IDs
  • Perform deletion on ES-level. For example, somehow automatically get the list of IDs from index2 and automatically delete corresponding documents from index1.
    Is it possible? How to configure this?

I don't see how your idea would solve the sequencing problem. At this point I don't have any good suggestions.

@magnusbaeck
Sorry, I have made a mistake in tables numbering and this might brought incorrect understanding of what am I trying to do. I have corrected my first post above.
So, I want logstash to do the following:

  • get the list of IDs from SQL table2
  • find documents with such IDs in index1
  • delete found documents from index1

From my point of view, sequencing is not important here. I have two parallel transfers:

  1. Regular incremental (using sql_last_value) data transfer from SQL table1 to index1
  2. Regular deletion of data from index1 by list of IDs supplied from SQL table2

If Logstash will try to delete document with some ID that will not exist in index1 yet then it is not a problem. He will be able to do this successfully within one of the next sync sessions because list of deleted IDs is supposed to be captured on non-incremental basis i.e. always "select * from table2".

According to clarifications above, do I have a chance to configure logstash to reach my goal?

Oh. In that case it sounds like all you need is a jdbc input and an elasticsearch output with action => "delete". If the id obtained from the SQL table isn't identical to the id of the ES document you need an elasticsearch filter as well. What kind of volumes are we talking about? Looking up every single event in ES is of course not very efficient.

@magnusbaeck
Sorry for late answer. IDs in SQL table1, SQL table2 and ES index1 - are identical. In SQL tables they have bigint format and in index1 they have "type": "long".
As for deletion,
SQL table1 will store smth about 100 millions of rows
ES index1 will store the same amount as SQL table1 i.e. smth about 100 millions of documents
SQL table2 will store smth about 100k IDs and once in halfyear can store 10 millions IDs.

So, delete operation usually will require from logstash to get smth about 100k IDs from SQL table2, to find them in ES index1 among 100 millions documents and to delete found documents from ES index1. As for halfyear peak, I understand that logstash transfer will require much more time than usual, and I'm OK with that.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.