We're using a redis -> logstash -> elasticsearch pipeline
Our test system is a single instance installation with 8CPU, 12 GB RAM, running on VMware
Currently we can’t separate the components, so no clustering is possible (will change in the future)
Here’s the Logstash redis input
redis {
data_type => "list“
host => "${REDIS_HOST:127.0.0.1}“
key => "import“
password => „xxx“
threads => "2"
codec => json { charset => "ASCII“ }
}
The messages are saved in 2 indexes (2 Outputs in logstash)
single-index collects all messages
summary-index collects special messages, a groovy script is used to create a summary record, that is updated frequently by id (5-x) times
Here's the Logstash output for the single-index
elasticsearch {
index => "single-index"
hosts => ["127.0.0.1"]
}
Here's the Logstash output for the summary-index
elasticsearch {
action => "update"
document_id => "%{uniqueId}"
index => "summary-index"
script => "summarize"
script_lang => "groovy"
script_type => "file"
scripted_upsert => true
retry_on_conflict => 5
hosts => ["127.0.0.1"]
}
Originally only a part of the messages were sent to the summary-index
For this scenario, the indexing rate was ok (max about 9000/s )
Now we’ve got data, that is stored in both indexes, and while it’s clear that the performance
can’t be like in the mixed scenario, we didn’t expect the performance numbers we’ve got
Sending 1000 msg/s to redis (20 msg per id -> 50 summary records)
Result:
Only 1600/s indexing rate (2000 would be enough to keep pace). The odd thing is, that the system has a CPU Usage of 50%, Load average of 4, so there seems to be headroom for a higher rate
By deactivating the single-message pipeline:
700/s
By deactivating the groovy script (single-message-pipeline still deactivated):
1000/s
So the question is, how could we improve this performance?
It’s clear that the upserts are the bottleneck.
thx