Degradation performance bulk insert

(swood) #1


I use ElasticSearch for collection applications log. But it's huge logs. I have 17 physical servers and 8 node on each servers. Each servers has 4 disks.
My config looks like this:

discovery.zen.minimum_master_nodes: 2 false 5s [""]
gateway.expected_nodes: 2
gateway.recover_after_nodes: 3
gateway.recover_after_time: 5m
gateway.type: local
index.indexing.slowlog.threshold.index.debug: 2s 5s
index.indexing.slowlog.threshold.index.trace: 500ms
index.indexing.slowlog.threshold.index.warn: 10s
index.number_of_replicas: 2
index.number_of_shards: 4 500ms 800ms 200ms 1s 2s 5s 500ms 10s
monitor.jvm.gc.young.debug: 400ms 700ms
monitor.jvm.gc.young.warn: 1000ms
network.publish_host: true
node.master: true "server_1" /var/www/elastic,/var/www/elastic,/var/www/elastic,/var/www/elastic
path.logs: /var/log/elasticsearch
transport.tcp.port: 9300
http.port: 9200
cluster.routing.allocation.disk.watermark.low: 1gb
cluster.routing.allocation.disk.watermark.high: 500mb
cluster.routing.allocation.node_concurrent_recoveries: 4
cluster.routing.allocation.node_initial_primaries_recoveries: 8
indices.recovery.concurrent_streams: 8
indices.recovery.max_bytes_per_sec: 100mb
threadpool.bulk.queue_size: 50000
index.query.default_field: host
refresh_interval: 1m
index.translog.interval: 30
index.translog.flush_threshold_ops: 50000
index.translog.flush_threshold_size: 512m
indices.memory.index_buffer_size: 30% mmapfs

Each node has using own path on all disks.
For insert of data I used to Logstash. For transport data from my servers to Logstash I used to logstash_forwarder with spool-size=512.

When cluster has not much nodes all works fine. But, when cluster is increased, Logstash begins to wait ES with next messages:

"Failed to flush outgoing items", :outgoing_count=>4872, :exception=>java.lang.OutOfMemoryError: Java heap space, :backtrace=>[], :level=>:warn}"

But, my attempts increases memory for Logstash have not been successful.
Maybe I have a wrong configuration for ElasticSearch?

(Mark Walkom) #2

If you are getting this error then this is a LS problem, not an ES one.

(swood) #3


Yes, you're right. This is a complex problem.
In this moment I've changed index.refresh_interval to 15m. And increased count of physical servers for 50.
But, I have about 600 clients for import data to LS and ES. After 5 minutes after start LS process it stopped. It waiting ES, but I don't know why..
What I can do wrong?

(Mark Walkom) #4

A few things. Having 4 entries that are the same location won't do anything. Increasing the threadpools like that will likely cause more problems than it is worth. Setting index.translog.interval to that means 30 milliseconds, not seconds.

You're also likely to be running into IO contention with that many nodes on that many disks, which won't help.

How much heap have you assigned to LS, to ES?
How much data do you have in the cluster?

(system) #5