Logstash Sending Old Data to ES

Aralex · January 28, 2019, 9:53am

Hi,
my pipeline consists of: filebeat -> logstash -> ES -> kibana. I use it to read CSV files in a folder and index/mirror any changes or updates in real time on kibana.

Problem 1:
My test CSV file has 16 entries bus only 3 are reaching kibana, I see only 3 entries on kibana's ES index management section.

Problem 2:
I'm using one of the fields as a finger print to guarantee one instance of each entry being sent to ES. It hasn't worked so far.

Problem 3:
When I fire up the pipeline, logstash sends old data to ES, data that has been indexed before even though the CSV file is no longer present in the specified folder. This tells me that this old data is cached shomewhere in the pipeline. Even after I delete everything in the 'data' folder on filebeat, this still happens.
I read a lot and experimented a lot about how filebeat and logstash work, and on how any why those files in the 'data' folder are created.
Why old data is being sent from logstash to ES?

Question:
I terminate everything in my pipeline through ctr+c from the powershell window. That writes some files so logstash starts from where it left the next time it is initiated. Could this be the cause of problem 3? Is there a recommended way to terminate them?

Thank you very much.

Christian_Dahlqvist · January 28, 2019, 9:54am

Please show your full configuration. Without this it is very hard to help.

Aralex · January 28, 2019, 1:19pm

Hello Christian, thank you for the reply. Below are my configs:

elasticsearch.yml: all commented out.

kibana.yml: all commnted out.

filebeat modules: all disabled.

filebeat.yml:
############################
filebeat.inputs:

type: log
paths:
- C:/filebeatTestPath/.csv
  filebeat.config.modules:
  path: ${path.config}/modules.d/.yml
reload.enabled: true

output.logstash:
hosts: ["localhost:5044"]
#########################

logstash.config:
#########################
input {
beats {
port => 5044
}
}
filter {
csv {
columns => ["col name 1","col name 2","col name 3","col name 4"]
autogenerate_column_names => false
separator => ";"
skip_empty_rows => true
skip_empty_columns => true
convert => {"col name 2" => "date_time"}
}
date {
match => ["col name 2","YYYY-MM-dd HH:mm:ss"]
target => "col name 2"
}
}
output {
elasticsearch {
hosts => ["http://localhost:9200"]
index => "mytestdata-%{+dd.MM.YYYY}"
template_overwrite => true
document_id => "%{[col name 1]}"
}
stdout {}
}
###############
Field mappings json file:
{
"template" : "mytestdata-%{+dd.MM.YYYY}",
"version" : 1,
"mappings" : {
"default" : {
"properties" : {
"@timestamp": {
"type": "date"
},
"col name 2": {
"type": "date",
"format": "YYYY-MM-dd HH:mm:ss"
},
"col name 1": {
"type": "keyword"
},
"col name 3": {
"type": "keyword"
},
"col name 4": {
"type": "keyword"
}
}
}
}
}
####################
I then add those mappings to ES through:
curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_template/my_index_template?pretty -d @mytemplate.json

Christian_Dahlqvist · January 28, 2019, 1:23pm

What does the CSV file look like?

Aralex · January 28, 2019, 1:27pm

I cannot share the data, but this is the structure:

col name 1;col name 2; col name 3; col name 4
text 1;date;text 3;text 4
text 1;date;text 3;text 4
text 1;date;text 3;text 4

There are no numeric values in the CSV except a date field. Everything else is a string/text field, which is why it's critical to have each entry represented as a single ES document (no copies from logstash sending the same data multiple times to guarantee delivery), this is why I'm trying to use a specific field as a fingerprint.

Christian_Dahlqvist · January 28, 2019, 1:35pm

How many unique values are there in the first column? As you are using this as document ID, only one document per unique ID will be created in Elasticsearch.

Aralex · January 28, 2019, 1:52pm

I see! So the order of columns is important? My first column has many repeated values. The one with unique values (IDs) are in column 3. So if I change it and put it as the first column, it would solve some issues?

Christian_Dahlqvist · January 28, 2019, 1:54pm

Rows are processed one by one so only the number of unique values will affect the document count. If you remove the document_id parameter from the output, Elasticsearch will assign an id to each row and all should get inserted.

Aralex · January 28, 2019, 2:21pm

I exchanged the column orders and placed the one with the unique values as the first column, since I'm using it as the document_id for fingerprinting. Also changed the field mapping order in logstash.config, and it worked!!
So, column order IS important.
Thank you Christian.

Another question: my test data has 20 ES documents (entries) only, but it's consuming 121.3kb according to the index management section. If I drag and drop the file manually, the index consumption is much smaller. Why is that? How can I reduce memory consumption?
If I use my full data with +45000 entries, mem consumption goes to 18mb which is pretty large for a csv file.

Aralex · January 28, 2019, 2:22pm

Also, is there a recommended order to fire up logstash and filebeat? or it doesn't matter? I tried both, seems similar for now, but asking just in case.

Christian_Dahlqvist · January 28, 2019, 2:24pm

The size your data takes up on disk depends on the mappings used but also the number of shards your index has, as compression typically improves as the shard size grows.

Aralex · January 28, 2019, 2:26pm

Thanks for the help Christian, I'll have a look at the link.
We can consider this case closed.

system · February 25, 2019, 2:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filebeat loading whole csv again, if new entries are added Beats filebeat	6	319	May 25, 2021
How to prevent old log appending existing log in elasticsearch Logstash	8	1001	February 7, 2019
Filebeat messages reaching logstash server, but then not ES Logstash	5	2987	July 6, 2017
Filebeat did not pass the file to Elasticsearch Beats filebeat	4	529	July 10, 2020
Don't send log to Elastic search Beats	9	1331	December 2, 2016

Logstash Sending Old Data to ES

Related topics