Hi,
We have a test setup with ES 2.4.0 on a single m4.large. It used to work ok with the nutch crawler indexing web pages to a content_store index, and another worker reading the web pages and indexing them to a search index, but we switched to storm crawler which is indexing quicker and now we're running into problems. I split the search index to a separate server, and upgraded the m4.large to an m4.xlarge with 16gb ram, but that just postpones the problem. We have three clients reading and writing from the main content_store index and a few others. They're streaming data constantly.
The heap keeps growing until it hits 99% and the GC starts taking over 30s, so all our clients start timing out with:
[#|2016-12-07T05:22:03.653Z|INFO |elasticsearch[Paralyzer][generic][T#187]|o.e.c.transport |[Paralyzer] failed to get node info for {#transport#-1}{x.x.x.x}{x.x.x.x:9300}, disconnecting...|Log4jESLogger.java:125|#]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][172.31.38.111:9300] cluster:monitor/nodes/liveness] request_id [545363] timed out after [5000ms]
And the elasticsearch log shows:
[2016-12-07 06:00:36,981][WARN ][monitor.jvm ] [Loa] [gc][old][38612][2214] duration [28s], collections [1]/[28.9s], total [28s]/[6.2h], memory [7.9gb]->[7.8gb]/[7.9gb], all_pools {[young] [247.3mb]->[206.9mb]/[266.2mb]}{[survivor] [0b]->[0b]/[33.2mb]}{[old]
[7.6gb]->[7.6gb]/[7.6gb]}
[2016-12-07 05:59:45,617][WARN ][index.indexing.slowlog.index] [Loa][content_store_v3][1] took[29.8s], took_millis[29816],
[2016-12-07 06:00:36,980][WARN ][index.search.slowlog.query] [Loa] [site_configs_v2][1] took[28s], took_millis[28015], types[siteConfig], stats[], search_type[SCAN], total_shards[5], source[{"size":10,"query":{"match_all":{}},"_source":{"includes":[],"excludes":[]}}], extra_source[],
[2016-12-07 06:03:05,321][WARN ][index.search.slowlog.fetch] [Loa] [content_store_v3][2] took[25.4s], took_millis[25465], types[contentStoreItem], stats[], search_type[QUERY_THEN_FETCH], total_shards[5], source[{"from":0,"size":500,"post_filter":{"terms":{"reIndex":["true"]}},"sort":[{"lastModified":{"order":"asc"}}]}], extra_source[],
We don't hit an OOM error, but the node just slows down until it's unusable, especially since the GC pauses are longer than the client timeout of 5s.
What would be the best route of attack here?
- Are there config options I've missed that could help?
- Is there a way to throttle clients automatically instead of them timing out with exceptions?
- Should I reduce the default 5 shards and 1 replica, to a single shard?
- Throw more hardware at the problem?
- Change to a multi-node cluster?
- Use doc values for the HTML content instead of "rawHtml": { "type": "string", "index": "no" }?
- Use something other than ES for the content store?
- ?
Thanks for any pointers and guidance!
Cheers,
Vaughn