We have two SQL tables: FileDetail for storing file details and FileUserActivity for file activities. Using Logstash, we're indexing data into Elasticsearch with a flat index approach, combining file details and activities in a single document.
However, when file details change in the FileDetail table (e.g., file owner modification), we need a solution to update all previous Elasticsearch indices related to that file.
Database Tables:
FileDetail Table:
Columns: fileid, ..., owner, classificationId
FileUserActivity Table:
Columns: fileId, activityId, activityType, ...
Elasticsearch Index Structure:
Each Elasticsearch document combines fields from both tables, creating a flat structure.
Logstash Approach:
Logstash is employed to fetch data from both tables and merge it into a single document before indexing into Elasticsearch.
Requirement:
How can we efficiently update all previous Elasticsearch indices when file details change in the FileDetail table? Considering a dataset of around 10 lakhs (1 million) records generated every month.
Query:
If the owner of a file (identified by fileid) changes in the FileDetail table, what approach should be taken to ensure all historical Elasticsearch indices related to that file are updated accordingly?
Conclusion:
Seeking guidance on an effective solution to handle updates in Elasticsearch indices via logstash. How should i design logstash so that it could handle this?
OK, so you have a denormalised view of these two tables in Elasticsearch.
If you want to efficiently update all related documents when the parent or child document changes through Logstsh, I believe each document to be updated will need a unique document ID in Elasticsearch. Let's assume each FileDetail record has a PK called FileId and that each FileUserActivity record is identified through a UUID field, which is a unique identifier. This means that each denormalised document can be uniquely identified through an ID that is the concatenation of these 2 fields.
In order to capture changes we assume you have (or will add) a timestamp field (Timestamp) to both tables that is updated when the record is created or updated through a trigger.
You can now use a JDBC input to capture any additions and changes to these 2 tables in Logstash. This will run a join query which merges these two tables. It will concatenate the two ID fields into the unique identifier for the denormalised record and Logstash will set this as the document ID. The query will select only records where the MAX of the two Timestamp fields is greater that the sql_last_value parameter, and this will ensure only added or altered records are processed.
If you want to combine this with a traditional time-based index, you can set @timestamp in Logstash using a date filter based on the created timestamp of the FileDetail table (does not change and ensures updates will go to the correct index).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.