Logstash performance for indexing to ES

I'm trying to index docs to elasticsearch from logstash with kafka input. I see the indexing takes a long time and i am looking to index around 3 million documents as fast as possible. My logstash conf is -

   input {
 kafka {
        bootstrap_servers => "kafka2.dev:9092"
        topics => ["READY_FOR_INDEX","INDEX_CSV"]
        codec => json
        consumer_threads => 8

      }

  }

 output {
    stdout { codec => rubydebug }
    if [type] == "fact" or [type] == "dimension" {
        elasticsearch {
            index => "%{index}"
            document_id => "%{id}"
            hosts => "xyz.amazonaws.com:9200"

            flush_size=>100000
        }
}else {
        elasticsearch {
            index => "%{index}"
            document_id => "%{id}"
            hosts => "xyz.amazonaws.com:9200"
            flush_size=>100000
        }
  }
 }

My logstash is on aws machine with 32 gb RAM and 8 cores.

Right now just for 1000 documents it takes around 4 minutes. By that calculation it would take really long to index 3 million.
I was under the assumption that logstash did ES bulk indexing.
Please help.

Right now just for 1000 documents it takes around 4 minutes.

Four minutes for 1000 docs? That's ridiculously slow. Logstash itself shouldn't have any problems processing hundreds of events per second.

I was under the assumption that logstash did ES bulk indexing.

It does.

Have you measured what the bottleneck is? What event rate do you get with a kafka input and e.g. a file output? What event rate do you get against ES without using Logstash?

1 Like

I think kafka has the enable_metrics field set to true in logstash by default. But how do i see where the metrics are stored? I guess thats how i get how fast the events are processed.

Thanks

You should be able to use Logstash's HTTP API for that.

Does the thruput also depend on how fast we write to kafka?
Example if i write to kafka serially or parallely?

Thanks

Unless you have a backlog in Kafka the read rate can obviously not be greater than the write rate. A sufficiently high inflow of new events could also affect the read rate if the server is heavily loaded, but I don't really know enough about Kafka to say if it employs any countermeasures against that (like RabbitMQ's throttling of publishers when the consumers can't keep up).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.