Logstash elasticsearch input error

Hello

I am extracting data from an http server through the apace pipeline, indicating the pipeline to use in the logstash output and it indexes correctly. On the other hand, I want to enrich the data already obtained and I have decided to make an Elasticsearch input collecting the data every 2 minutes. My input currently looks like this:

input {
     # saca todo de los indices que pertenezcan al formato index entre las fechas introducidas
  elasticsearch {
        hosts => ["elk1:9200","elk2:9200","elk3:9200"]
        ssl => true
        user => elastic
        password => "xxxx"
        ca_file => '/path/cert.pem'
        index => "filebeat-7.16.3-mmmmm-http-server-2022*"
        schedule => "*/2 * * * *"
        size => 10000
        query => '{
          "query":{
            "range":{
              "@timestamp":{
                "gte": "now-2m",
                "lte": "now"
              }
            }
          }
        }'
  }
}

But there are some small differences. Is there any method to be able to use the pipeline and enrich in the same etl?

otherwise what would be the best config for the input ??

It would be a great help. Thanks!!!

If you are going through logstash in the first ingest yes, it should be possible to do so in one go.
What exactly are you trying to enrich/do?

thanks for answering!

the first etl consists of an input and an output with the parameter pipeline => apache-access.

the second has an Elasticsearch input to get the data parsed by etl 1.8

With the data already parsed by the Apache module, I want to extract words from the [url][original] field or digits from the [source][address] field.

My question is if I could put a filter after the output or if i can put something similar as :sql_last_value of jdbc plugin or if there is a more efficient method to do this.

Thank you in advance!

Is there any reason you need such two staged pipeline? Filter plugins between HTTP input and some output seems enough to extract some words or digits from some field.

Because i dont know how to tell logstash to use the filebeat-apache-access pipeline before the indexing. So i need a second pipeline to parse all the fields that i want with the data previously indexed at Elasticsearch

Can i download the pipeline in a correct format to paste it in the logstash config pipeline? Can i put a filter after an output? How can i use the module pipeline and extract all the info that i want in the same pipeline?

So you are using probably as the first pipeline.

Sorry I'm not familiar with filebeat and don't understand filebeat-apache-access pipeline. But I suppose there are some similar logstash input plugin to access your HTTP server.

Or using logstash beat input could be more simple.

What did you mean "some small differences"?

Yes im using filebeat input via the 5044 logstash port and at the output i set the pipeline parameter to use the pipeline preconfigured by the elastic team at the first pipeline.
When this parsed fields are indexed to elastic i use another pipeline with Elasticsearch input to keep this data and parse it again.

Yes , there is small differences. For example in my first index i could have a count 101 of status code = 200 and at the new one the count is 97.. is an example.

I tried to correct that difference with the fingerprint plugin, creating an id for all the documents concatenating 3 fields( [event][created] , [source][address] and [original][url] ) and using it as the document_id => %{fingerprint_id} and it reduce the difference but not at all

Part of the reason is that you use 2 minutes span every 2 minutes. There could be some gaps and overlaps when some delay has occurred somewhere. Even if the pipeline runs strictly every 2 minutes, there are about up to one second delay for documents to be able to be searched in Elasticsearch after indexed, some documents should be dropped.

You have to run the 2 minutes span pipeline more frequently or prolonged span pipeline every 2 minutes. And deduplicate the documents by using fingerprint as you said.

Use update action reduce the indexing load of the Elasticsearch cluster.

I have it set 2 by 2 because if the time increases I exceed the elastic limitation of 10,000 documents...

Then do the pipeline more frequently, or connect filebeat direct to logstash and output to two indices from logstash.

can not increase the 10.000 limit , can i ?

i dont know well how to do it because if i increase i excedeed the limit but if i decrease i think i will not get all the documents. Mi workflow is generating 5.000 documents per minute.
How can i set 1 pipeline to two index and set it to balance the outputs?

sorry for the inconvenience and thank you very much for the help

It's a possible solution to raise the limit.

just use two output plugin as this example.

and how can i increase?

i see that but i dont understad how to divide the data equitably in two indexs.

Now i am targeting 3 different data nodes .

Have you searched by yourself?

Yes, I have been trying things like the ones we mentioned for several days but there are always some limitations or problems, I will continue investigating, thanks for the help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.