We have recently built out our Elasticsearch cluster and are now running
close to our true volume of data through it. Unfortunately it seems like
our cluster basically runs out of steam after a bit and can't keep up.
The cluster consists of four physical machines with 32 CPUs and 252gb of
memory. We are currently running three ES instances on each box, a search
instance, a master instance and a data instance. We are currently inserting
about 34k-40k documents a second that vary in size but are usually in the
1kb-6kb range (log entries). We run what we call our consumers on other
hardware, these are what pull the messages in, format them and then send
them to Elasticsearch. These are written in java and use the transport
client + bulk api to send the documents. We are using version 0.90.5 on
java 7u25.
We currently keep one index per day with the following settings:
"index.number_of_replicas" : "1",
"index.number_of_shards" : "8",
"index.indexing.slowlog.threshold.index.warn": "10s",
"index.refresh_interval" : "10s"
The issue we seem to be having is the cluster does great for a bit,
sometimes up to 30-40 minutes but then it starts acting like it is unable
to keep up. Our insert rate starts becoming erratic (it should be pretty
steady) and the results start taking longer and longer to show up. We had
seen some queues in the bulk thread pool so we thought we'd try increasing
that to:
"threadpool.bulk.type" : "fixed",
"threadpool.bulk.size" : "1024",
"threadpool.bulk.queue_size" : "2000"
That seemed to get rid of the queue issue but didn't change the fact that
the cluster just stops keeping up. We also noticed that our merge times
seem to be all over the place:
[2014-01-03 08:06:43,486][DEBUG][index.merge.scheduler ] [data]
[2014.01.03][21] merge [_22u] done, took [29.2s]
[2014-01-03 08:06:44,520][DEBUG][index.merge.scheduler ] [data]
[2014.01.03][13] merge [_1pv] done, took [4.3m]
[2014-01-03 08:06:44,683][DEBUG][index.merge.scheduler ] [data]
[2014.01.03][29] merge [_yf] done, took [22.5m]
[2014-01-03 08:06:47,908][DEBUG][index.merge.scheduler ] [data]
[2014.01.03][8] merge [_1y0] done, took [1.2m]
[2014-01-03 08:06:48,685][DEBUG][index.merge.scheduler ] [data]
[2014.01.03][4] merge [_1w3] done, took [3.2m]
[2014-01-03 08:06:48,785][DEBUG][index.merge.scheduler ] [data]
[2014.01.03][12] merge [_1vf] done, took [31.1s]
We are new to Elasticsearch but have to assume that merges taking that long
are bad right?
Is this just a case of our cluster cannot support our volume or are there
some settings we can play with to get this working right?
TIA
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/788ece22-516e-4938-945c-842eeaaf43ee%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.