So with this configuration you are getting OOM errors or Bulk rejections?
yes , and I wonder if there's any difference when dealing with bulk requests between V2 and V5 , because the front node keeps doing GC , and it seems the Object wont be relased , so finally , Old GC will take a long time , OOM in the end ! Is there anything should be cared about the bulk requests to make use of the new version's performance ?
So you are getting both? Bulk rejections first, then OOM?
It's super unclear to me.
What do you mean by the front node
? Do you have dedicated data: false, master: false
coordinating nodes?
If not you should consider it IMO.
Are you also running queries while injecting your data? Can you just try to inject without running any search?
I'd like to be sure we are trying to fix the right issue.
Can you also run this and add it a gist.github.com link?
curl -XGET localhost:9200/_cluster/settings
front node
I mean the node of the cluster which receives requests and echo responses ,because I have set that node's IP as the uploading address in my client .- I do not know which of OOM and Bulk rejections is first ! But the ES cluster won't response if OOM occurs .
- Actually , there no queries as you can see the params of
get and search
thread_pool are zero , If any , they would be related to Kibana . - Your advice of
data: false, master: false
coordinating nodes sounds good ! But I do think there is sth wrong .
5 . As you said , I have kept the default setting . socurl -XGET localhost:9200/_cluster/settings
will get empty result like :
{ "persistent": {}, "transient": {} }
Can you reduce the bulk size to something like 50?
Is your client sending multiple bulk requests in parallel or one by one?
I would also recommend spreading the indexing load across all nodes in the cluster instead of overloading a few.
Well the data uploading service is online , and it is working with ES 2.1.0 ; I added an upload branch to it to test ES 5.0 , but I can't touch it too much . Anyway , there are cases with multiple bulk requests , but not all the cases .
yes , it may be a solution , but I have to figure it out , otherwise ,the upgrading solution wont be passed
@small-tomorrow is your index green ie. all shards allocated on your 5.0 cluster (as far as I understand you have a second cluster for testing)?
Yes , I have two clusters , Es 2.1 and ES 5.0 , but each cluster is totally independent , and I just post data t by sending a http post request for each cluster.
btw. both cluster have the same hardware configuration , e.g. number of cpu ,node , memory ,etc. but cluster with ES2.1 works fine
you didn't answer my question, what's the state of the cluster ie. what does cluster health return?
It's yellow , there are some unassigned shards , and I have not allocated it
ok can you just for testing make sure that all shards are allocated, ie. reduce the number of replicas to 0
for at least the indices you are indexing data and see if you still run OOM?
Yoou mean, I should try to turn the cluster status into green(allocated) , and then reduce the number of replicas to 0
, and see the results , is it ?
You can just set the number of replicas to 0 that should turn your cluster green
I did as you said , and I‘ll let you know the result observed
There no OOM anyway , but It keeps doing GC like ;
[2016-11-30T18:40:36,849][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02] [gc][1601] overhead, spent [280ms] collecting in the last [1s] [2016-11-30T18:40:37,905][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02] [gc][1602] overhead, spent [387ms] collecting in the last [1s] [2016-11-30T18:40:38,057][INFO ][o.e.c.s.ClusterService ] [es-nmg02] added {{es-nmg49}{wX-6uJa2TeOo3-7_a9CcXg}{-Db5ltn1SY6WxA5tpLA3fA}{10.107.26.24}{10.107.26.24:9300},}, reason: zen-disco-receive(from master [master {es-nmg52}{RRGW9EtLRJ6N-F6BjCesyQ}{k-yeny4oR1O7gRXZHczNew}{10.107.26.27}{10.107.26.27:9300} committed version [800]]) [2016-11-30T18:40:38,061][WARN ][o.e.c.NodeConnectionsService] [es-nmg02] failed to connect to node {es-nmg49}{wX-6uJa2TeOo3-7_a9CcXg}{-Db5ltn1SY6WxA5tpLA3fA}{10.107.26.24}{10.107.26.24:9300} (tried [1] times) org.elasticsearch.transport.ConnectTransportException: [es-nmg02][10.107.26.24:9300] connect_timeout[30s] at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:379) ~[?:?] at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:403) ~[elasticsearch-5.0.0.jar:5.0.0] at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:377) ~[elasticsearch-5.0.0.jar:5.0.0] at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:285) ~[elasticsearch-5.0.0.jar:5.0.0] at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:110) [elasticsearch-5.0.0.jar:5.0.0] at org.elasticsearch.cluster.NodeConnectionsService.connectToAddedNodes(NodeConnectionsService.java:83) [elasticsearch-5.0.0.jar:5.0.0] at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:674) [elasticsearch-5.0.0.jar:5.0.0] at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:894) [elasticsearch-5.0.0.jar:5.0.0] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:444) [elasticsearch-5.0.0.jar:5.0.0] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:237) [elasticsearch-5.0.0.jar:5.0.0] at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:200) [elasticsearch-5.0.0.jar:5.0.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_45] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_45] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45] Caused by: io.netty.channel.AbstractChannel$AnnotatedSocketException: Connection reset by peer: /10.107.26.24:9300 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?] at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?] at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:347) ~[?:?] at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:627) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:513) ~[?:?] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:467) ~[?:?] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:437) ~[?:?] at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) ~[?:?] ... 1 more [2016-11-30T18:40:38,932][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02] [gc][1603] overhead, spent [465ms] collecting in the last [1s] [2016-11-30T18:40:40,065][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02] [gc][1604] overhead, spent [308ms] collecting in the last [1.1s]
...
then, when I check the health , It turned red from green :
> {
> "cluster_name": "uaq-es5.0-online",
> "status": "red",
> "timed_out": false,
> "number_of_nodes": 25,
> "number_of_data_nodes": 25,
> "active_primary_shards": 1054,
> "active_shards": 1056,
> "relocating_shards": 0,
> "initializing_shards": 0,
> "unassigned_shards": 51,
> "delayed_unassigned_shards": 0,
> "number_of_pending_tasks": 0,
> "number_of_in_flight_fetch": 0,
> "task_max_waiting_in_queue_millis": 0,
> "active_shards_percent_as_number": 95.39295392953929
> }
well I think you are putting too much pressure on your cluster here so GC can't keep up? maybe don't send as many bulks as you are doing today?
our service do have so much data to upload ,` and the cluster ES 2.1 works just fine now . You know , I mustn't decrease the amount of data , But still , I want to upgrade ES version
just to verify:
- all nodes have identical HW to the 2.1 cluster
- same JVMs, same settings on all the nodes?
- refresh intervals are the same?
- you are using the same durability settings ie. no async commit on the 2.1 cluster?
- can you provide the
GET _nodes/stats/jvm,thread_pool?pretty
stats from the 2.1 cluster as well?
thanks