Did I get to the upper limit of /bulk upload API?

dadoonet · November 23, 2016, 2:14pm

So with this configuration you are getting OOM errors or Bulk rejections?

small-tomorrow · November 23, 2016, 2:55pm

yes , and I wonder if there's any difference when dealing with bulk requests between V2 and V5 , because the front node keeps doing GC , and it seems the Object wont be relased , so finally , Old GC will take a long time , OOM in the end ! Is there anything should be cared about the bulk requests to make use of the new version's performance ?

dadoonet · November 23, 2016, 3:17pm

So you are getting both? Bulk rejections first, then OOM?

It's super unclear to me.

What do you mean by the front node? Do you have dedicated data: false, master: false coordinating nodes?
If not you should consider it IMO.
Are you also running queries while injecting your data? Can you just try to inject without running any search?

I'd like to be sure we are trying to fix the right issue.
Can you also run this and add it a gist.github.com link?

curl -XGET localhost:9200/_cluster/settings

small-tomorrow · November 24, 2016, 2:07am

front node I mean the node of the cluster which receives requests and echo responses ,because I have set that node's IP as the uploading address in my client .
I do not know which of OOM and Bulk rejections is first ! But the ES cluster won't response if OOM occurs .
Actually , there no queries as you can see the params of get and search thread_pool are zero , If any , they would be related to Kibana .
Your advice of data: false, master: false coordinating nodes sounds good ! But I do think there is sth wrong .
5 . As you said , I have kept the default setting . so curl -XGET localhost:9200/_cluster/settings will get empty result like :

{
   "persistent": {},
   "transient": {}
}

dadoonet · November 24, 2016, 4:54am

Can you reduce the bulk size to something like 50?

Is your client sending multiple bulk requests in parallel or one by one?

Christian_Dahlqvist · November 24, 2016, 7:57am

I would also recommend spreading the indexing load across all nodes in the cluster instead of overloading a few.

small-tomorrow · November 24, 2016, 11:48am

Well the data uploading service is online , and it is working with ES 2.1.0 ; I added an upload branch to it to test ES 5.0 , but I can't touch it too much . Anyway , there are cases with multiple bulk requests , but not all the cases .

small-tomorrow · November 24, 2016, 11:50am

yes , it may be a solution , but I have to figure it out , otherwise ,the upgrading solution wont be passed

s1monw · November 30, 2016, 7:15am

@small-tomorrow is your index green ie. all shards allocated on your 5.0 cluster (as far as I understand you have a second cluster for testing)?

small-tomorrow · November 30, 2016, 7:49am

Yes , I have two clusters , Es 2.1 and ES 5.0 , but each cluster is totally independent , and I just post data t by sending a http post request for each cluster.

btw. both cluster have the same hardware configuration , e.g. number of cpu ,node , memory ,etc. but cluster with ES2.1 works fine

s1monw · November 30, 2016, 8:12am

you didn't answer my question, what's the state of the cluster ie. what does cluster health return?

small-tomorrow · November 30, 2016, 8:35am

It's yellow , there are some unassigned shards , and I have not allocated it

s1monw · November 30, 2016, 8:36am

ok can you just for testing make sure that all shards are allocated, ie. reduce the number of replicas to 0 for at least the indices you are indexing data and see if you still run OOM?

small-tomorrow · November 30, 2016, 8:46am

Yoou mean, I should try to turn the cluster status into green(allocated) , and then reduce the number of replicas to 0 , and see the results , is it ?

s1monw · November 30, 2016, 9:56am

You can just set the number of replicas to 0 that should turn your cluster green

small-tomorrow · November 30, 2016, 10:17am

I did as you said , and I‘ll let you know the result observed

small-tomorrow · November 30, 2016, 11:25am

There no OOM anyway , but It keeps doing GC like ;

[2016-11-30T18:40:36,849][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02] [gc][1601] overhead, spent [280ms] collecting in the last [1s]
[2016-11-30T18:40:37,905][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02] [gc][1602] overhead, spent [387ms] collecting in the last [1s]
[2016-11-30T18:40:38,057][INFO ][o.e.c.s.ClusterService   ] [es-nmg02] added {{es-nmg49}{wX-6uJa2TeOo3-7_a9CcXg}{-Db5ltn1SY6WxA5tpLA3fA}{10.107.26.24}{10.107.26.24:9300},}, reason: zen-disco-receive(from master [master {es-nmg52}{RRGW9EtLRJ6N-F6BjCesyQ}{k-yeny4oR1O7gRXZHczNew}{10.107.26.27}{10.107.26.27:9300} committed version [800]])
[2016-11-30T18:40:38,061][WARN ][o.e.c.NodeConnectionsService] [es-nmg02] failed to connect to node {es-nmg49}{wX-6uJa2TeOo3-7_a9CcXg}{-Db5ltn1SY6WxA5tpLA3fA}{10.107.26.24}{10.107.26.24:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [es-nmg02][10.107.26.24:9300] connect_timeout[30s]
       	at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:379) ~[?:?]
       	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:403) ~[elasticsearch-5.0.0.jar:5.0.0]
       	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:377) ~[elasticsearch-5.0.0.jar:5.0.0]
       	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:285) ~[elasticsearch-5.0.0.jar:5.0.0]
       	at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:110) [elasticsearch-5.0.0.jar:5.0.0]
       	at org.elasticsearch.cluster.NodeConnectionsService.connectToAddedNodes(NodeConnectionsService.java:83) [elasticsearch-5.0.0.jar:5.0.0]
       	at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:674) [elasticsearch-5.0.0.jar:5.0.0]
       	at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:894) [elasticsearch-5.0.0.jar:5.0.0]
       	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:444) [elasticsearch-5.0.0.jar:5.0.0]
       	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:237) [elasticsearch-5.0.0.jar:5.0.0]
       	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:200) [elasticsearch-5.0.0.jar:5.0.0]
       	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_45]
       	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_45]
       	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]
Caused by: io.netty.channel.AbstractChannel$AnnotatedSocketException: Connection reset by peer: /10.107.26.24:9300
       	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
       	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
       	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:347) ~[?:?]
       	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
       	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:627) ~[?:?]
       	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:513) ~[?:?]
       	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:467) ~[?:?]
       	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:437) ~[?:?]
       	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) ~[?:?]
       	... 1 more
[2016-11-30T18:40:38,932][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02] [gc][1603] overhead, spent [465ms] collecting in the last [1s]
[2016-11-30T18:40:40,065][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02] [gc][1604] overhead, spent [308ms] collecting in the last [1.1s]

...

then, when I check the health , It turned red from green :
> {
> "cluster_name": "uaq-es5.0-online",
> "status": "red",
> "timed_out": false,
> "number_of_nodes": 25,
> "number_of_data_nodes": 25,
> "active_primary_shards": 1054,
> "active_shards": 1056,
> "relocating_shards": 0,
> "initializing_shards": 0,
> "unassigned_shards": 51,
> "delayed_unassigned_shards": 0,
> "number_of_pending_tasks": 0,
> "number_of_in_flight_fetch": 0,
> "task_max_waiting_in_queue_millis": 0,
> "active_shards_percent_as_number": 95.39295392953929
> }

s1monw · November 30, 2016, 11:38am

well I think you are putting too much pressure on your cluster here so GC can't keep up? maybe don't send as many bulks as you are doing today?

small-tomorrow · November 30, 2016, 12:03pm

our service do have so much data to upload ,` and the cluster ES 2.1 works just fine now . You know , I mustn't decrease the amount of data , But still , I want to upgrade ES version

s1monw · November 30, 2016, 12:28pm

just to verify:

all nodes have identical HW to the 2.1 cluster
same JVMs, same settings on all the nodes?
refresh intervals are the same?
you are using the same durability settings ie. no async commit on the 2.1 cluster?
can you provide the GET _nodes/stats/jvm,thread_pool?pretty stats from the 2.1 cluster as well?

thanks

Topic		Replies	Views
Attempt to index a large dataset fails Elasticsearch	12	532	July 6, 2017
Large Scale elastic Search Logstash collection system Elasticsearch	6	451	July 6, 2017
Rejected execution (queue capacity 50) in bulk process Elasticsearch	11	3612	July 6, 2017
ES bulk insert time out Elasticsearch	9	12599	July 6, 2017
ElasticSearch can't automatically recover after a big HEAP utilization Elasticsearch	3	407	July 6, 2017

Did I get to the upper limit of /bulk upload API?

Related topics