I am trying to understand how logstash design sql data ingestion.
For example, logstash can compare the key of a record to avoid ingesting duplicate records. It can use version to update an existing record.
Does this mean logstash do this?
step 1. pull all records from a table.
step 2. for each record, query the sink if there is already a record having the same key. If yes, the version of this record.
step 3. based on the sink query result, decide whether and how to ingest this request.
Or instead of querying existing records from the sink, does logstash maintain a local database and get the existing records info from the local database?
I am looking into logstash source code but has not got the answer.
No, that's not how it works. Logstash never checks for document existence in the output phase. What you typically do is set the document id to a suitable fixed id (e.g. the primary key(s) from the database). That way each update to an existing rows in the source database will be directed to the same document in ES. if the document Logstash is trying to index is identical to what's already there I can only assume that ES does nothing and returns immediately.
This behavior assumes the output (e.g. ES) is compliant with some contract. For example, logstash assumes the output queries whether a record exists by document id, or compares the new record with the existing record having the same document id and decide whether to do nothing or create a new record by version+1.
Since logstash has many output plugins, is there a document describing what contract these output plugins need to comply so that the situation that one source works for one output but not for others does not happen?
If I need to implement an output plugin, any document for the development process? Thanks.
Since logstash has many output plugins, is there a document describing what contract these output plugins need to comply so that the situation that one source works for one output but not for others does not happen?
Logstash has no idea whether a particular output supports duplication avoidance. That's entirely up to the plugin. Therefore there is no contract and no documentation.
If I need to implement an output plugin, any document for the development process? Thanks.
Yes, the Logstash documentation contains a section or two on plugin development.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.