Update performance - very low indexing rate

We're using a redis -> logstash -> elasticsearch pipeline

Our test system is a single instance installation with 8CPU, 12 GB RAM, running on VMware
Currently we can’t separate the components, so no clustering is possible (will change in the future)

Here’s the Logstash redis input

redis { 
    data_type => "list“  
    host => "${REDIS_HOST:127.0.0.1}“  
    key => "import“  
    password => „xxx“  
    threads =>  "2"  
    codec => json {     charset => "ASCII“  }
}

The messages are saved in 2 indexes (2 Outputs in logstash)

single-index collects all messages
summary-index collects special messages, a groovy script is used to create a summary record, that is updated frequently by id (5-x) times

Here's the Logstash output for the single-index

 elasticsearch {
      index => "single-index"
      hosts => ["127.0.0.1"]
    }

Here's the Logstash output for the summary-index

elasticsearch {
            action => "update"
            document_id => "%{uniqueId}"
            index => "summary-index"
            script => "summarize"
            script_lang => "groovy"
            script_type => "file"
            scripted_upsert => true
            retry_on_conflict => 5
            hosts => ["127.0.0.1"]
          }

Originally only a part of the messages were sent to the summary-index

For this scenario, the indexing rate was ok (max about 9000/s )

Now we’ve got data, that is stored in both indexes, and while it’s clear that the performance
can’t be like in the mixed scenario, we didn’t expect the performance numbers we’ve got

Sending 1000 msg/s to redis (20 msg per id -> 50 summary records)

Result:
Only 1600/s indexing rate (2000 would be enough to keep pace). The odd thing is, that the system has a CPU Usage of 50%, Load average of 4, so there seems to be headroom for a higher rate

By deactivating the single-message pipeline:
700/s

By deactivating the groovy script (single-message-pipeline still deactivated):
1000/s

So the question is, how could we improve this performance?
It’s clear that the upserts are the bottleneck.

thx

What does the script do?

it extracts values of the message sent to a summary records (firstMessage, lastMessage, computed state, List of IPs, etc.)

Which version of Elasticsearch are you using?

sry, forgot to mention, logstash + elasticsearch 5.6.1

There was a change that affected certain types of update scenarios for ES 5.X as outlined in the release notes. This was also discussed in this thread. Does this match how you are doing updates?

ok, yes, thanks, so the best way is to solve very frequent updates at application level, right? Didn't find a fitting solution using logstash for this use case (anybody knows how?), working on a POC by using another service.

One message of the thread you posted

As of 5.0.0 the get API will issue a refresh if the requested document has been changed since the last refresh but the change hasn’t been refreshed yet. This will also make all other changes visible immediately. This can have an impact on performance if the same document is updated very frequently using a read modify update pattern since it might create many small segments. This behavior can be disabled by passing realtime=false to the get request.

realtime=false

could that be an option to speed up the current solution (Until i developed a new one), can i add this the the logstash output?

thank you very much

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.