Optimizing ressouce Usage on Startup, Search and reindex


#1

Hi All,
We are running a Single Node Elastic Search Instance with ~200 Indices each 15-20 GB (populated by using Logstash with daily index). We are running this on a server with 40 logical CPUs and 128GB Ram. We recently updated from an ELK Stack version 4 (ES 2.4) to version 5.6 and noticed that the new setup is taking significant longer for reindexing, searches and startup time (until all indicees are green) while at the same time the CPU usage and IO Rate is very low.

The server resources should be sufficient because. I assume this because when we hold back the logs for a larger period and then start up logstash we notice a very heavy load on the server (all cores are at max, IO write is up to 200 MB/s and in a very short time all logs have been processed and pushed into ES. So we are very satisfied with this.

What I do not understand how to tune is the ressource usage for reindexing, Checking indices after a restart or searches. When we recently needed to reindex the whole data (String to Keyword/text change) we constantly had a low load on the server (only 1 or two cores were at 100% and IO read/write was at maximum 20MB/s) the whole reindexing took about a week.

The same effect happens when we are restarting ES (low CPU usage, low IO Rate and it takes up to 10 minutes for all Indicees to be green.

And also the same for large searches (multiple weeks to month) from Kibana. The server load (CPU+IO) is low and the search takes up to 15 Minutes.

I would understand that things take longer if the server resources are maxed out, but everything I checked is not at the limit. So the question is where do I need to start tweaking to make sure the available resources are used as much as possible to reduce the time for reindexing, startup and search.

And the second question is: is it recommended to run multiple nodes on on server to ensure better responsetimes and resource usage.

Thanks
Chris

Here is my node stats:

curl -XGET 'http://localhost:9200/_nodes?pretty'
{
"_nodes" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"cluster_name" : "elasticsearch",
"nodes" : {
"VDvA-RbVQxyQi5Ei6IsF7g" : {
...
"version" : "5.6.8",
"build_hash" : "688ecce",
"total_indexing_buffer" : 25696527974,
...
"indices" : {
"fielddata" : {
"cache" : {
"size" : "15%"
}
},
"memory" : {
"index_buffer_size" : "30%",
"min_index_buffer_size" : "96mb"
}
},
...
"os" : {
"refresh_interval_in_millis" : 1000,
"name" : "Linux",
"arch" : "amd64",
"version" : "4.4.0-116-generic",
"available_processors" : 40,
"allocated_processors" : 32
},
"process" : {
"refresh_interval_in_millis" : 1000,
"id" : 36125,
"mlockall" : true
},
...
"input_arguments" : [
"-Xms80g",
"-Xmx80g",
"-XX:+UseConcMarkSweepGC",
"-XX:CMSInitiatingOccupancyFraction=75",
"-XX:+UseCMSInitiatingOccupancyOnly",
"-XX:+AlwaysPreTouch",
"-Xss1m",
"-Djava.awt.headless=true",
"-Dfile.encoding=UTF-8",
"-Djna.nosys=true",
"-Djdk.io.permissionsUseCanonicalPath=true",
"-Dio.netty.noUnsafe=true",
"-Dio.netty.noKeySetOptimization=true",
"-Dio.netty.recycler.maxCapacityPerThread=0",
"-Dlog4j.shutdownHookEnabled=false",
"-Dlog4j2.disable.jmx=true",
"-Dlog4j.skipJansi=true",
"-XX:+HeapDumpOnOutOfMemoryError",
"-Des.path.home=/usr/share/elasticsearch"
]
},
"thread_pool" : {
"force_merge" : {
"type" : "fixed",
"min" : 1,
"max" : 1,
"queue_size" : -1
},
"fetch_shard_started" : {
"type" : "scaling",
"min" : 1,
"max" : 64,
"keep_alive" : "5m",
"queue_size" : -1
},
"listener" : {
"type" : "fixed",
"min" : 10,
"max" : 10,
"queue_size" : -1
},
"index" : {
"type" : "fixed",
"min" : 32,
"max" : 32,
"queue_size" : 200
},
"refresh" : {
"type" : "scaling",
"min" : 1,
"max" : 10,
"keep_alive" : "5m",
"queue_size" : -1
},
"generic" : {
"type" : "scaling",
"min" : 4,
"max" : 128,
"keep_alive" : "30s",
"queue_size" : -1
},
"warmer" : {
"type" : "scaling",
"min" : 1,
"max" : 5,
"keep_alive" : "5m",
"queue_size" : -1
},
"search" : {
"type" : "fixed",
"min" : 49,
"max" : 49,
"queue_size" : 1000
},
"flush" : {
"type" : "scaling",
"min" : 1,
"max" : 5,
"keep_alive" : "5m",
"queue_size" : -1
},
"fetch_shard_store" : {
"type" : "scaling",
"min" : 1,
"max" : 64,
"keep_alive" : "5m",
"queue_size" : -1
},
"management" : {
"type" : "scaling",
"min" : 1,
"max" : 5,
"keep_alive" : "5m",
"queue_size" : -1
},
"get" : {
"type" : "fixed",
"min" : 32,
"max" : 32,
"queue_size" : 1000
},
"bulk" : {
"type" : "fixed",
"min" : 32,
"max" : 32,
"queue_size" : 200
},
"snapshot" : {
"type" : "scaling",
"min" : 1,
"max" : 5,
"keep_alive" : "5m",
"queue_size" : -1
}
},


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.