Duplication Due to rollover policy in kibana,data coming from logstash pipeline

Hello All,

I am ingesting data into a rollover index. The data comes from perl scripts running every 15 min,20 min,4hour,12 hour,16 hour etc and gives updates on previously ingested events through doc_as_upsert used in logstash.
Each event has a field that is unique ex:(service name) and I am using this as the document ID.
However, I've come across an issue recently where I have the same event listed multiple times with each of them existing in separate indices.
Is there any way I can avoid previous rollover index data not to be searched for and only current index data should be read.

Example of Issue faced.

I have a usecase where I'm having status of service and showing there count.So current day the count will be correct and now when index get rollover it would multiply.Here upsert is used on
service name.
Here the perl script is executed every 30 min and data comes and always the data is updated with latest count ,but due to rollover created daily the counts on kibana visual keep on multiplying.This issue is faced on multiple usecases and not sure how to handle this through ILM policy or logstash.

The Idea is to maintain single policy for all our usecases and due to rollover used along with upsert data duplication occurs.Best way to handle this?

PUT _ilm/policy/cis-monitor-reportorder-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "1d"
          },
          "set_priority": {
            "priority": null
          }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {
            "delete_searchable_snapshot": true
          }
        }
      }
    }
  }
}

Logstash Config:

input {
   exec {
      command => '. ../scripts/reportordermonitor/run_reportordermonitor.sh'
      schedule => "0 */2 * * * *"
   } 
}

filter {
   if [message] =~ "^\{.*\}[\s\S]*$" {
      json {
         source => "message"
         target => "parsed_json"
		 remove_field => "message"
      }
	  
      split {
         field => "[parsed_json][ReportOrderMonitorReponse]"
         target => "reportorder"
         remove_field => [ "parsed_json" ]
      }
   }
   else {
     drop { }
   }
}

output {
   elasticsearch {
      hosts => "http://abc:9200"
	  ilm_pattern => "{now/d}-000001"
      ilm_rollover_alias => "cis-monitor-reportorder"
	  ilm_policy => "cis-monitor-reportorder-policy"
	  doc_as_upsert => true
	  document_id => "%{[reportorder][recordUniqueId]}"
   } 
}





Guidance and help would be highly helpful as require early resolution due to delivery contraints.

Many Thanx

The _id is unique only on a index base, different indices can have the same _id.

Since you have a rollover enabled, when your index rollover, the writing alias will change and you may have an document with the same _id that already exists on a previous indice.

If you want to have a unique _id in multiple indices you will need to change the way you index the data, you will need first to search all the indices where the _id must be unique to get the index name, and then if found, you will index the document in this index.

Hello @leandrojmp ,

Thanx for your quick response.
Actually my issue will get resolved if somehow only latest rollover index is only read and previous rollover index reads are avoided.
Initially I thought would create ILM policy to rollover everyday and after 1second from rollover will delete the index,meanwhile when script is run again data will come in new
index created.
The reason to not do above is need consistent "same policy to be applied to all my other usecases indices".
Here all the documents are geting updated and hence past data is no longer required.

Still unclear how to resolve or best address this issue.

Many Thanx

So you do not want any of the data that you are rolling over? Why are you using rollover at all?

Hello @Badger ,

So I run few perl scripts for each of my usecases,few uscases requires data to be maintained for say X amount of days and later not required.
Ex last 7 days data should be maintained and rest not required.

Secondly for some usecases I am using upsert where in the documents itself gets updated and again this data should be visble for ex 15 days and later not required.
I'm using ILM policy so that after specific days the data gets deleted and new data should be written to new index again.
The only issue I'm facing is when this upsert logic is used in output filter of logstash,as same documts are present in diff rollover indices there is duplication.

As per the requirement whether used upsert or not ,the policy should be maintained only one for different usecases.I'm not sure if policy are not maid then all the write operation would be done on single index and manually need to delete them?

Would be helpful to know your appraoch to resolve this,taking into consideration above image data.
The data I receive is mostly statistics data or data from database that keep on updating
and according to usecase its run for few minutes or hourly basis everyday and then data is recived by elastic.

Thanx

This is not possible, but I don't think that it would fix your issue.

You didn't give more context on how you are indexing your data, are you querying your data in elasticsearch from your script to know which _id to update? It is not clear.

The main issue here is that _id are uniques per index only, if you are using a rollover policy you will have multiple indices and end up with duplicates, there is no configuration in the ILM policy to solve this.

If you can't use just one index, you will need to query your index for each _id and if found, write the data in the correct index, you will need to change your logstash pipeline to add some conditionals to help you do that.

Hello @leandrojmp ,

Are you querying your data in elasticsearch from your script to know which _id to update? It is not clear
YES, I'm queryiing the data in elasticsearch through script and always updating the unique id "service name" if any change is done through doc_as_upsert in logstash.
I understood your point there is no alternative to this apart from maintaing single Index to avoid rollover duplication.

Thanx for looking into this.

@gbrown
I read your this post and you suggessted one solution,I'm not sure how this can be done.
How can this be configured in logstash pipeline,updates happens in rollover index and only new documents comes in present index.

I also have somewhat similar requirement and I have been looking up for the solution on internet for a while. I somehow landed on the thread below:

The solution/suggestion I found in one comment that caught my attention was " making a custom logstash filter that sends a query to related indices by wild card (logstash-*) for records with that _id and if there are existing ones and not belongs to current day's index, it will remove them and insert the current event to current index instead - a bit clumsy but works".
@Badger just want to know if there is a way above solution is configurable in logstash. If yes, then how? To me, I am a bit skeptical with the solution even if it is possible because looking into indices, searching and deleting might have consequences on the performance of logstash. Correct me if I am wrong. Would love to know your take on this.

I do not run elasticsearch, but I think it might work (I do not know if the _delete API accepts a wildcard index). Obvioously adding a round-trip to elasticsearch will add cost to the processing of every event.

There is an example of using an http filter to make a call to elasticsearch here.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.