How to Increase Indexing rate


(Venkatesh) #1

Stack: Logstash 2.1 --> Elastic Search 2.1 --> Kibana 4.3.0
Logstash Input: File

Server Spec: RHEL, 8 Core, 16 GB RAM

Indexing Rate:
1 Cluster - 1 Node --> Approx 2500 to 3000 /Sec
1 Cluster - 2 Nodes --> Approx 5000 to 5500 /Sec
1 Cluster - 3 Nodes --> Approx 5800 to 6000 /Sec
Notes:
All the nodes are in the same server.
The documents are of medium size with max 8-10 fields.
Logstash worker threads tried with 2 and 4 as well - The filter expressions are simple ones.
Set the Index refresh_interval to -1.
The CPU utilization is going upto 85%.

To me Index rate of 5000/sec is pretty slow and the solution wouldn't work.
Can you please suggest the standard configuration to achieve say 50K indexing per sec.
I know this is a generic & open ended statement. However would like to know the standard parameters to consider for sizing the stack.

Thanks


(Aaron Mildenstein) #2

@mvenkat_in, the bottleneck at 3 nodes is not likely to be Elasticsearch, but Logstash. As you can see, the speed increase is no longer linear after 2 nodes. Part of that is because replicas and primary shards are more distributed, so there is more disk I/O available. The part where the ingest doesn't scale is that you haven't shown your Logstash configuration. Heavy filter usage can limit your output. A small Elasticsearch cluster can usually index much more quickly than a single Logstash instance with a moderate amount of event filtering can feed it.

How many Logstash instances do you have? Are you using a broker, like Redis/Kafka/RabbitMQ? What is your input source? What filters and/or other outputs do you have?

Without the answer to these questions, we can only surmise for what reasons your indexing speed is less than you'd like.


(Tin Le) #3

You are maxing out your HW I/O. Is this server using spinning disk? or SSD?

Tin


(Venkatesh) #4

Hi Aaron

I have only one Logstash Instance running to read this specific input..
I don't use any broker.
The source of input is a file (or list of files from a directory) - each with of 1.5 GB size.

How Can I run multiple instances of Logstash reading from the same input file or how can I increase throughput of Logstash !!

Or In order to isolate the problem if in Logstash or Elastic search, is there any way to load test (high Load indexing) into Elastic Search and see if getting better indexing rate !!

Below is the logstash input configuration file.


input {
file {
path => "/syshome/app/elk_agent/notif/data/ns*.log"
start_position => "beginning"
sincedb_path => "notif"
type => "notif"
}
}

filter {

if [type] == "notif"
{
if [message] =~ "ElapsedTime"
{
grok {
match => {"message" =>"%{DATESTAMP:trans_dtm} %{NOTSPACE} %{LOGLEVEL} %{SPACE} %{NOTSPACE} %{JAVAFILE} %{SPACE} %{NOTSPACE} %{NOTSPACE} %{WORD}:%{WORD:action}-%{WORD}:%{WORD:trans_id}-%{WORD}:%{WORD:msisdn}-%{WORD}:%{WORD:service}-%{WORD}:%{GREEDYDATA:attr_1}-%{WORD}:%{WORD:status}-%{WORD}:%{WORD:response_code}-%{WORD}:%{WORD:attr_2}-%{WORD}:%{WORD:attr_3}-%{WORD}:%{WORD:response_time}"}
}
mutate {
remove_field => [ "message","path","type","host"]
gsub => ["trans_dtm", ",\d{3}$", ""]
convert => {
"response_code" => "integer"
"response_time" => "integer"
}
add_field => ["response_desc", "%{status}"]
}

grok {
 match => ["trans_dtm", "^%{MONTHDAY:day}-%{MONTHNUM:month}-%{YEAR:year}"]
}

mutate {
  add_field => {"[@metadata][index]" => "%{year}-%{month}-%{day}"}
  remove_field => ["year", "month", "day"]
}

mutate {add_field => {"[@metadata][type]" => "notif"}}

	date {
	 "match" => [ "trans_dtm", "dd-MM-YYYY HH:mm:ss" ]
   	  target => "trans_dtm"
	  }

}
else
{
drop{ }
}
}

}

output {
if "_grokparsefailure" not in [tags] {
elasticsearch {
hosts => "10.0.148.65:9200"
index => "e2e-%{[@metadata][index]}"
document_type => "%{[@metadata][type]}"
}
}
else
{
file {path => "/elkapp/elk/app/logstash-2.1.0/logs/parseerr.log"}
}
}



(Venkatesh) #5

HI Tinle
The disks are on the SAN storage.
Below is the SAR Output.


(Tin Le) #6

Ah yeah, SAN is not recommended. At least you won't get high speed indexing there.

The other issue is you only have one logstash feeding your cluster. You'll need to use some form of buffer in the middle so you can spin up multiple logstash as consumers. The buffer can be message queue such as Redis or Kafka (which is my recommendation).

I've gotten much higher indexing rate than you got on a 1 node cluster (20x), but I had 25 logstash feeding it. This was in a test environment running ES 1.7.3, very fast SSD, and very specific to my environment. YMMV.

Tin


(Venkatesh) #7

Hi Tin
hmm...OK It is a Fiber Channel SAN Storage. So thought it would be faster.
But how can I deep dive and Isolate the are of problem i.e., slowness at the sending end (Logstash) or receiving end (Elastic search)..

Is there any way to monitor the Logstash throughput or Queue sizes !!
How Can I run multiple logstash instances in parallel say reading from the same input!!

Thanks


(Tin Le) #8

You need to use a queue as a buffer. Something like this:

logfiles -> logstash -> message queue -> logstash (multiple) -> elasticsearch

Monitoring logstash metric is rather lacking at the moment. The only thing available that I know of at the moment is logstash metrics filter.

https://www.elastic.co/guide/en/logstash/current/plugins-filters-metrics.html

Tin


(Mike Simos) #9

Can you try using separate inputs for each file instead of a wildcard. As its a single thread per file input. See if this makes any difference.


(Venkatesh) #10

Hi Mike
I understand your input. However the results I had posted earlier are only for single file.


(system) #11