I am extracting data from an http server through the apace pipeline, indicating the pipeline to use in the logstash output and it indexes correctly. On the other hand, I want to enrich the data already obtained and I have decided to make an Elasticsearch input collecting the data every 2 minutes. My input currently looks like this:
input {
# saca todo de los indices que pertenezcan al formato index entre las fechas introducidas
elasticsearch {
hosts => ["elk1:9200","elk2:9200","elk3:9200"]
ssl => true
user => elastic
password => "xxxx"
ca_file => '/path/cert.pem'
index => "filebeat-7.16.3-mmmmm-http-server-2022*"
schedule => "*/2 * * * *"
size => 10000
query => '{
"query":{
"range":{
"@timestamp":{
"gte": "now-2m",
"lte": "now"
}
}
}
}'
}
}
But there are some small differences. Is there any method to be able to use the pipeline and enrich in the same etl?
otherwise what would be the best config for the input ??
the first etl consists of an input and an output with the parameter pipeline => apache-access.
the second has an Elasticsearch input to get the data parsed by etl 1.8
With the data already parsed by the Apache module, I want to extract words from the [url][original] field or digits from the [source][address] field.
My question is if I could put a filter after the output or if i can put something similar as :sql_last_value of jdbc plugin or if there is a more efficient method to do this.
Is there any reason you need such two staged pipeline? Filter plugins between HTTP input and some output seems enough to extract some words or digits from some field.
Because i dont know how to tell logstash to use the filebeat-apache-access pipeline before the indexing. So i need a second pipeline to parse all the fields that i want with the data previously indexed at Elasticsearch
Can i download the pipeline in a correct format to paste it in the logstash config pipeline? Can i put a filter after an output? How can i use the module pipeline and extract all the info that i want in the same pipeline?
Sorry I'm not familiar with filebeat and don't understand filebeat-apache-access pipeline. But I suppose there are some similar logstash input plugin to access your HTTP server.
Or using logstash beat input could be more simple.
Yes im using filebeat input via the 5044 logstash port and at the output i set the pipeline parameter to use the pipeline preconfigured by the elastic team at the first pipeline.
When this parsed fields are indexed to elastic i use another pipeline with Elasticsearch input to keep this data and parse it again.
Yes , there is small differences. For example in my first index i could have a count 101 of status code = 200 and at the new one the count is 97.. is an example.
I tried to correct that difference with the fingerprint plugin, creating an id for all the documents concatenating 3 fields( [event][created] , [source][address] and [original][url] ) and using it as the document_id => %{fingerprint_id} and it reduce the difference but not at all
Part of the reason is that you use 2 minutes span every 2 minutes. There could be some gaps and overlaps when some delay has occurred somewhere. Even if the pipeline runs strictly every 2 minutes, there are about up to one second delay for documents to be able to be searched in Elasticsearch after indexed, some documents should be dropped.
You have to run the 2 minutes span pipeline more frequently or prolonged span pipeline every 2 minutes. And deduplicate the documents by using fingerprint as you said.
Use update action reduce the indexing load of the Elasticsearch cluster.
i dont know well how to do it because if i increase i excedeed the limit but if i decrease i think i will not get all the documents. Mi workflow is generating 5.000 documents per minute.
How can i set 1 pipeline to two index and set it to balance the outputs?
sorry for the inconvenience and thank you very much for the help
Yes, I have been trying things like the ones we mentioned for several days but there are always some limitations or problems, I will continue investigating, thanks for the help.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.