Why is my latest rollover index collecting the data from beginning to last?

good day, I'm experiencing this scenario, the rollover index is collecting data from the beginning to the newest/last data of the logs, as I know only the newest log should be written to the latest rollover index and the old logs should be in the old index. for your guidance please.

in this index (000019) it also has the logs of 2023-09-12

{
  "index_templates" : [
    {
      "name" : "hourlyreport-template",
      "index_template" : {
        "index_patterns" : [
          "hourlyreport-*"
        ],
        "template" : {
          "settings" : {
            "index" : {
              "lifecycle" : {
                "name" : "default_policy",
                "rollover_alias" : "hourlyreport-alias"
              },
              "number_of_shards" : "1",
              "number_of_replicas" : "0"
            }
          },
          "mappings" : {
            "properties" : {
              "DATE" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "ignore_above" : 256,
                    "type" : "keyword"
                  }
                }
              },
              "Hour" : {
                "type" : "integer"
              },
              "Count" : {
                "type" : "integer"
              },
              "tailed_path" : {
                "type" : "text",
                "fields" : {
                  "keyword" : {
                    "ignore_above" : 256,
                    "type" : "keyword"
                  }
                }
              }
            }
          }
        },
        "composed_of" : [ ]
      }
    }
  ]
}

Hello,

Can you provide a more context? It is not clear what you mean by 'collecting the data from beginning to last'.

Also, what does your ILM policy looks like? After how many Days / GB are you rolling over your index?

thank you for the reply and sorry for incomplete details. what I mean about collecting data is the index also write the old data/log. it should not write the logs that is already written in the index.

this is my latest test. this is the original log from dev server, it has only 2 line log.
image

but when the update (Hour: 11) came, it also wrote the Hour 10 in the index, that is why I have 2 value for Hour 10.
image

policy for testing only (I have multiple index using this policy):

{
  "default_policy" : {
    "version" : 3,
    "modified_date" : "2023-09-11T02:16:15.408Z",
    "policy" : {
      "phases" : {
        "hot" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_size" : "500kb",
              "max_age" : "1h",
              "max_docs" : 500
            }
          }
        },
        "delete" : {
          "min_age" : "1d",
          "actions" : {
            "delete" : {
              "delete_searchable_snapshot" : true
            }
          }
        }
      }
    },
    "in_use_by" : {
      "indices" : [
        "infocasthourly-000020"
      ],
      "data_streams" : [ ],
      "composable_templates" : [
        "default-template",
        "hourlyreport-template",
        "infocast-template"
      ]
    }
  }
}

In which backing indice was the data in marked in yellow written? You didn't share it, so it is a little hard to understand what is the issue.

How are you sending the data to elasticsearch? Are you using custom ids?

If the data is getting duplicate it is probably an issue while sending the data, the rollover just points to a new write indice.

Also these configs are extemely inefficient even for testing, are you using like this in production?

          "actions" : {
            "rollover" : {
              "max_size" : "500kb",
              "max_age" : "1h",
              "max_docs" : 500
            }
          }

Im using td-agent as log forwarder and rollover config is not used in production, just for rollover testing

in this index (00001), when the count 27 came in / write, this also write count 25 and 26.
then when count 28 came in, this also write count 25,26 and 27.
image

this should not write the previous data again, because in the source log there is only 4 lines
image

From what you shared I don't see any issue with rollover, as I mentioned, rollover has no control if the data will be duplicated or not, it is just changes what is the active write index.

It looks like you have an issue with the tool you are using to send the data, I have no experience with td-agent/fluentd/fluentbit and you didn't provide any config or any information no how you are creating the source file, but it seems that it is reading the source file from the beginning every time.

For example, when you had 2 lines, it read the file and send those 2 lines, them when you had 3 lines, it read the file from the beginning again and send the 3 lines, the same with the 4 lines.

You may test this by sending the data to a new test index without rollover configured or even send to another output, like a file if this is possible.

Also, When you add a new line on the source file are you appending the line into the source file or are you overwritting the file?

thank you for looking on this case :slight_smile:
I dont also see any issue in my template/mappings/rollover, I will try your suggestion by sending the data to a new test index without rollover config. my source logs is appending new line, it will not overwrite the log, just adding new line every update.

below is my td-agent.conf. I have 2 sources on this 1 conf and I dont have any issue on the first config (report_daily.txt) it only writes what is the new data in the rollover index.

just to show you the 1st config, it does not contain any duplicate value per index:
image

<source>
  @type tail
  read_from_head true
  path <path>/report_daily.txt
  pos_file /tmp/TrnxCount.pos
  path_key tailed_path
  enable_stat_watcher false
  tag TrnxCount
  format /^(?<DATE>[^ ]*) (?<TransactionCount>[^|]*)/
  time_format %d/%b/%Y:%H:%M:%S
</source>

<match TrnxCount>
 @type elasticsearch
 host <IP>
 port 9200
 index_name <index>-alias
 type_name fluentd
 user <username>
 password <passwd>
</match>

<source>
  @type tail
  read_from_head true
  path <path>/Hourly_report.txt
  pos_file /tmp/HourlyReport.pos
  path_key tailed_path
  enable_stat_watcher false
  tag HourlyReport
  format /^(?<Date>[^ ]*)\-(?<Hour>[^ ]*) (?<Count>[^ ]*)$/
  time_format %Y-%m-%d
</source>

<match HourlyReport>
 @type elasticsearch
 host <IP>
 port 9200
 index_name HourlyReport-alias
 type_name fluentd
 user <username>
 password <passwd>
</match>

UPDATE:
I delete all config in kibana (template, index management and data views) and empty again the source log. when I update the source log just adding the first line, it updates the index management docs count from 0 to 1, then I update the source log again adding the 2nd line and I saw 3 docs count in index management. this testing has no template, rollover and data views yet just from source logs to index management (indices). does this means i have issue in my td-agent.conf?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.