Logstash Performance comparison _update_by_query v/s pulling data from DB via SQL

I have created 2 logstash pipelines to index docs in Index A and Index B. I am pulling data from 2 tables in Oracle DB and indexing it in Index A & B respectively.

The requirement is to update all docs in Index B for a given attribute value term/filter, if the attribute changes in Index A

I am trying to trigger one to many bulk updates to docs in Index B using _update_by_query & HTTP output plugin in logstash i.e. if a doc updates in Index A , I want to update multiple docs in Index B.

Approach# 1
Here is what i have done to achieve this i Index A:

          filter {
             ruby {
    		code => "		
    			boolArr = [
    				{
              			'term'=> {
    						'rootAttr.childAttr1.keyword' => {
    							'value' => event.get('[rootAttr][childAttr1]')
    						}
    					}
    				},
    				{
              			'term'=> {
    						'rootAttr.childAttr2.division.keyword' => {
    							'value' => event.get('[rootAttr][childAttr2]')
    						}
    					}
    				}
    			]			
    			      		
          		event.set('[query][bool][must]', boolArr )
          		event.set('[script][params][newRootAttr]', event.get('rootAttr') )
          		     		      		      		
    			"
    	}   
      mutate {
        add_field => {
          "[script][lang]" => "painless"
          "[script][source]" => "ctx._source.rootAttr = params.newRootAttr"
        
        }
      }
    }
    output {

      http {
        url => "${elasticsearch_hosts}/index_B/_update_by_query?conflicts=proceed"
        http_method => "post"
        format => "json"
      }
    }

Approach# 2:
Pull data from DB based on greatest value of tracking column from both the tables & re-index docs in Index B

I want to understand what is the best approach out of above 2 approaches from performance and error handling perspective.

Thanks,
Mo

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.