Logstash Sending Old Data to ES

Hi,
my pipeline consists of: filebeat -> logstash -> ES -> kibana. I use it to read CSV files in a folder and index/mirror any changes or updates in real time on kibana.

Problem 1:
My test CSV file has 16 entries bus only 3 are reaching kibana, I see only 3 entries on kibana's ES index management section.

Problem 2:
I'm using one of the fields as a finger print to guarantee one instance of each entry being sent to ES. It hasn't worked so far.

Problem 3:
When I fire up the pipeline, logstash sends old data to ES, data that has been indexed before even though the CSV file is no longer present in the specified folder. This tells me that this old data is cached shomewhere in the pipeline. Even after I delete everything in the 'data' folder on filebeat, this still happens.
I read a lot and experimented a lot about how filebeat and logstash work, and on how any why those files in the 'data' folder are created.
Why old data is being sent from logstash to ES?

Question:
I terminate everything in my pipeline through ctr+c from the powershell window. That writes some files so logstash starts from where it left the next time it is initiated. Could this be the cause of problem 3? Is there a recommended way to terminate them?

Thank you very much.

Please show your full configuration. Without this it is very hard to help.

Hello Christian, thank you for the reply. Below are my configs:

elasticsearch.yml: all commented out.

kibana.yml: all commnted out.

filebeat modules: all disabled.

filebeat.yml:
############################
filebeat.inputs:

  • type: log
    paths:

    • C:/filebeatTestPath/.csv
      filebeat.config.modules:
      path: ${path.config}/modules.d/
      .yml

    reload.enabled: true

output.logstash:
hosts: ["localhost:5044"]
#########################

logstash.config:
#########################
input {
beats {
port => 5044
}
}
filter {
csv {
columns => ["col name 1","col name 2","col name 3","col name 4"]
autogenerate_column_names => false
separator => ";"
skip_empty_rows => true
skip_empty_columns => true
convert => {"col name 2" => "date_time"}
}
date {
match => ["col name 2","YYYY-MM-dd HH:mm:ss"]
target => "col name 2"
}
}
output {
elasticsearch {
hosts => ["http://localhost:9200"]
index => "mytestdata-%{+dd.MM.YYYY}"
template_overwrite => true
document_id => "%{[col name 1]}"
}
stdout {}
}
###############
Field mappings json file:
{
"template" : "mytestdata-%{+dd.MM.YYYY}",
"version" : 1,
"mappings" : {
"default" : {
"properties" : {
"@timestamp": {
"type": "date"
},
"col name 2": {
"type": "date",
"format": "YYYY-MM-dd HH:mm:ss"
},
"col name 1": {
"type": "keyword"
},
"col name 3": {
"type": "keyword"
},
"col name 4": {
"type": "keyword"
}
}
}
}
}
####################
I then add those mappings to ES through:
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_template/my_index_template?pretty -d @mytemplate.json

What does the CSV file look like?

I cannot share the data, but this is the structure:

col name 1;col name 2; col name 3; col name 4
text 1;date;text 3;text 4
text 1;date;text 3;text 4
text 1;date;text 3;text 4

There are no numeric values in the CSV except a date field. Everything else is a string/text field, which is why it's critical to have each entry represented as a single ES document (no copies from logstash sending the same data multiple times to guarantee delivery), this is why I'm trying to use a specific field as a fingerprint.

How many unique values are there in the first column? As you are using this as document ID, only one document per unique ID will be created in Elasticsearch.

I see! So the order of columns is important? My first column has many repeated values. The one with unique values (IDs) are in column 3. So if I change it and put it as the first column, it would solve some issues?

Rows are processed one by one so only the number of unique values will affect the document count. If you remove the document_id parameter from the output, Elasticsearch will assign an id to each row and all should get inserted.

I exchanged the column orders and placed the one with the unique values as the first column, since I'm using it as the document_id for fingerprinting. Also changed the field mapping order in logstash.config, and it worked!!
So, column order IS important.
Thank you Christian.

Another question: my test data has 20 ES documents (entries) only, but it's consuming 121.3kb according to the index management section. If I drag and drop the file manually, the index consumption is much smaller. Why is that? How can I reduce memory consumption?
If I use my full data with +45000 entries, mem consumption goes to 18mb which is pretty large for a csv file.

Also, is there a recommended order to fire up logstash and filebeat? or it doesn't matter? I tried both, seems similar for now, but asking just in case.

The size your data takes up on disk depends on the mappings used but also the number of shards your index has, as compression typically improves as the shard size grows.

Thanks for the help Christian, I'll have a look at the link.
We can consider this case closed.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.