Did I get to the upper limit of /bulk upload API?

fter upgrade from V 2.1.0 to V 5.0.0, I seemed to be so easy to got a OOM ERROR using /bulk api to upload data .few minutes after my service started, I would receive this :' [gc][91065] overhead, spent [16.5s] collecting in the last [16.5s]
java.lang.OutOfMemoryError: Java heap space ' . What is the upper limit of /bulk API ?
I have just gave 32G to ES, and my total memory size is 90G, more , part of my elasticsearch.yml is :

script.engine.groovy.inline.aggs: on

indices.fielddata.cache.size: 40%
indices.breaker.fielddata.limit: 60%
indices.breaker.request.limit: 40%
indices.breaker.total.limit: 70%
thread_pool.bulk.queue_size: 5000

However, I hava 1 billion of items(documents) to upload every day , any advice ?

You are storing up to 5000 bulk requests in memory here?
I'm guessing you are consuming too much memory.

Don't change pool settings unless you really know what you are doing.

What is your hardware? What are your index settings?
How many nodes?

**I have 25 nodes ,and each machine‘s Memory is 100G, and SSD is 2T : **
The index _setting return :

"mobile_1234648_20161119": {
"settings": {
"index": {
"creation_date": "1479537360203",
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "38jRv1wwTx2JTsohl7EnJQ",
"version": {
"created": "5000099"
},
"provided_name": "mobile_1234648_20161119"
}
}
}

however, there are many type with keyword in the mapping , which is string in my template
and follow is thread_pool if it can help :

"ZntwknM8Tn-J3rLGiNXA6A": {
"name": "es-nmg02-vs-image160",
"transport_address": "10.107.29.28:9300",
"host": "10.107.29.28",
"ip": "10.107.29.28",
"version": "5.0.0",
"build_hash": "253032b",
"roles": [
"master",
"data",
"ingest"
],
"thread_pool": {
"watcher": {
"type": "fixed",
"min": 60,
"max": 60,
"queue_size": 1000
},
"force_merge": {
"type": "fixed",
"min": 1,
"max": 1,
"queue_size": -1
},
"fetch_shard_started": {
"type": "scaling",
"min": 1,
"max": 24,
"keep_alive": "5m",
"queue_size": -1
},
"listener": {
"type": "fixed",
"min": 6,
"max": 6,
"queue_size": -1
},
"index": {
> "type": "fixed",
> "min": 12,
> "max": 12,
> "queue_size": 200
> },
"refresh": {
"type": "scaling",
"min": 1,
"max": 6,
"keep_alive": "5m",
"queue_size": -1
},
"generic": {
"type": "scaling",
"min": 4,
"max": 128,
"keep_alive": "30s",
"queue_size": -1
},
"warmer": {
"type": "scaling",
"min": 1,
"max": 5,
"keep_alive": "5m",
"queue_size": -1
},
"search": {
"type": "fixed",
"min": 19,
"max": 19,
"queue_size": 1000
},
"flush": {
"type": "scaling",
"min": 1,
"max": 5,
"keep_alive": "5m",
"queue_size": -1
},
"fetch_shard_store": {
"type": "scaling",
"min": 1,
"max": 24,
"keep_alive": "5m",
"queue_size": -1
},
"management": {
"type": "scaling",
"min": 1,
"max": 5,
"keep_alive": "5m",
"queue_size": -1
},
"get": {
"type": "fixed",
"min": 12,
"max": 12,
"queue_size": 1000
},
"bulk": {
> "type": "fixed",
> "min": 12,
> "max": 12,
> "queue_size": 5000
> },
"snapshot": {
"type": "scaling",
"min": 1,
"max": 5,
"keep_alive": "5m",
"queue_size": -1
}
}
}

What is the indexation rate on a single empty node with 1 shard and 0 replica. How many docs you can hold in this single index until it reaches a bad response time bar?

Please do the tests with all default unmodified settings.

What is the number of docs you need to index per second?

The next test is IMO to increase the number of shards (up to 25 as you have 25 data nodes) and see how it goes.
Also try to play with your bulk size (on the client).

Check that your tool which is generating data is fast enough as well.

you mean that I have to figure out the limit of one node ,and then figure out the total limit by test ? to see whether I should expand nodes's number or what else ?

I am confused about how index action will do ? cause when I post data to one of the nodes , and that node will receive a OOM later , but other nodes are all ok ! will ES reroutes or does something to balance the index requests ? or is there a best practice to make it balanceable ?

I have a few questions: How large are your documents? How large are your bulk requests? How many concurrent indexing threads/connections do you have? Are you distributing bulk requests evenly across all data nodes? What indexing throughput are you seeing, especially if you set the bulk indexing queue size back to its default value?

Yes.

You still have OOM after removing your threadpool settings?

You are trying here to solve a problem but using the wrong path.
I mean that you said you have an index rate problem apparently so you increased the queue size for bulk. This is the wrong solution IMO.

Cause the ES service was developed by other guys , and the ES service is online anyway (I have set the bulk indexing queue to default ) , the answers to your questions are:
1.Bulk request size : 500
2.on average, the concurrency is : 1 / s , but the peak value is depended by how much data is post by user clients
3.the requests do not distribute across all data nodes
4.the index rate is about 20,000 per second according to Kibana monitoring
5. the documents size is about 20kb

Yes , still OOM after removing thread_pool setting . You are right , and increasing queue size is not the solution , but I wonder what the results are of official performance tests .

It's said that ES 5.0 has a great improvement in performance , how can the bulk requests crash the services when it works fine with ES 2.1.0 ?

If I'm not mistaken, you have:

  • document size: 20kb
  • bulk size: 500, so 500 * 20kb = 10mb
  • bulk queue: 5000: 5000 * 10mb = 50gb

So again, my guess is that you are hitting OOM issues because of that.

I have tried 1000 bulk queue size , and the same problem , anyway , the JVM keeps collect garbage and after it receive " [gc][392] overhead, spent [251ms] collecting in the last [1s] " for a while , OOM occures ! Well , I think requests is too frequent for my service ! But It works fine when I change upload URL to ES 2.1.0 .
should I decrease the requests frequency ? there must be some differences with it between V2.1 and V5.0

And why not keeping the default value for queue size?

I mean, can you use a default untouched version of elasticsearch 5.0 and send the bulk requests as you did previously and see what is happening.

If you have bulk rejections, then let's work on fixing that instead of increasing the queue size.

Yes , I did as you said , keep the default queue_size and OOM still here ! however , I checked the circuit breaker , I found this :

"breakers": {
"request": {
"limit_size_in_bytes": 13709003980,
"limit_size": "12.7gb",
"estimated_size_in_bytes": 84059360,
"estimated_size": "80.1mb",
"overhead": 1,
"tripped": 0
},
"fielddata": {
"limit_size_in_bytes": 20563505971,
"limit_size": "19.1gb",
"estimated_size_in_bytes": 25376,
"estimated_size": "24.7kb",
"overhead": 1.03,
"tripped": 0
},
"in_flight_requests": {
"limit_size_in_bytes": 34272509952,
"limit_size": "31.9gb",
"estimated_size_in_bytes": 0,
"estimated_size": "0b",
"overhead": 1,
"tripped": 0
},
"parent": {
"limit_size_in_bytes": 23990756966,
"limit_size": "22.3gb",
"estimated_size_in_bytes": 84084736,
"estimated_size": "80.1mb",
"overhead": 1,
"tripped": 0
}
}

1.dose it mean the breaker consume too much memory as it plused to be 85.8G=12.7+19.1+31.8+22.2 G ? and there is no in_flight_requests breaker in ES 2.1.0 .
2. It just keep GC like below ,and soon or later it will receive OOM

[2016-11-23T10:20:13,983][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][332] overhead, spent [265ms] collecting in the last [1s]

Here is the log I got with the default queue_size :

[2016-11-23T12:50:36,222][INFO ][o.e.h.HttpServer ] [es-nmg02-jpaas212] publish_address {10.107.49.21:8201}, bound_addresses {10.107.49.21:8201}
[2016-11-23T12:50:36,223][INFO ][o.e.n.Node ] [es-nmg02-jpaas212] started
[2016-11-23T12:50:38,750][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][7] overhead, spent [340ms] collecting in the last [1s]
[2016-11-23T12:50:40,789][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][9] overhead, spent [305ms] collecting in the last [1s]
[2016-11-23T12:51:50,672][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][78] overhead, spent [298ms] collecting in the last [1s]
[2016-11-23T12:52:05,616][INFO ][o.e.c.s.ClusterService ] [es-nmg02-jpaas212] removed {{es-nmg02-mco-bu-cloud-storage8}{9FS_jJV5TyeFu3PVSOMa4w}{h6MXXwu0Tni-TC_166ki_g}{10.107.133.24}{10.107.133.24:9300},}, reason: zen-disco-receive(from master [master {es-nmg02-ecom-jpaas049}{wX-6uJa2TeOo3-7_a9CcXg}{93NMOOgERdO8Soai8-2uRQ}{10.107.26.24}{10.107.26.24:9300} committed version [427]])
[2016-11-23T12:52:06,789][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][94] overhead, spent [282ms] collecting in the last [1s]
[2016-11-23T12:52:10,342][INFO ][o.e.c.s.ClusterService ] [es-nmg02-jpaas212] added {{es-nmg02-mco-bu-cloud-storage8}{9FS_jJV5TyeFu3PVSOMa4w}{h6MXXwu0Tni-TC_166ki_g}{10.107.133.24}{10.107.133.24:9300},}, reason: zen-disco-receive(from master [master {es-nmg02-ecom-jpaas049}{wX-6uJa2TeOo3-7_a9CcXg}{93NMOOgERdO8Soai8-2uRQ}{10.107.26.24}{10.107.26.24:9300} committed version [431]])
[2016-11-23T12:53:19,692][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][166] overhead, spent [288ms] collecting in the last [1.1s]
[2016-11-23T12:53:42,929][WARN ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][181] overhead, spent [8.6s] collecting in the last [9.1s]
[2016-11-23T12:53:43,930][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][182] overhead, spent [351ms] collecting in the last [1s]
[2016-11-23T12:54:01,352][WARN ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][191] overhead, spent [8.5s] collecting in the last [9.4s]
[2016-11-23T12:54:01,459][ERROR][o.e.x.m.AgentService ] [es-nmg02-jpaas212] exception when exporting documents
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.doFlush(ExportBulk.java:148) ~[x-pack-5.0.0.jar:5.0.0]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.close(ExportBulk.java:77) ~[x-pack-5.0.0.jar:5.0.0]
at org.elasticsearch.xpack.monitoring.exporter.Exporters.export(Exporters.java:194) ~[x-pack-5.0.0.jar:5.0.0]
at org.elasticsearch.xpack.monitoring.AgentService$ExportingWorker.run(AgentService.java:208) [x-pack-5.0.0.jar:5.0.0]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulk [default_local]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.doFlush(LocalBulk.java:114) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.flush(ExportBulk.java:62) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.doFlush(ExportBulk.java:145) ~[?:?]
... 4 more
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:121) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.doFlush(LocalBulk.java:111) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.flush(ExportBulk.java:62) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.doFlush(ExportBulk.java:145) ~[?:?]
... 4 more
[2016-11-23T12:54:10,491][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][old][192][21] duration [8.8s], collections [1]/[9.1s]
......

And again , it will receive OOM later .
Btw, according to the reference of V5.0 , the default queue_size is 50 ,and I checked that . It is too small IMO .

Can you run:

GET _nodes/stats/jvm,thread_pool?pretty

And please format your code using </> icon. It will make your post more readable.

"8m1coAQ8TrKSUWOsQExSgg": {
         "timestamp": 1479901823669,
         "name": "es-nmg02",
         "transport_address": "10.107.49.59:9300",
         "host": "10.107.49.59",
         "ip": "10.107.49.59:9300",
         "roles": [
            "master",
            "data",
            "ingest"
         ],
         "jvm": {
            "timestamp": 1479901823669,
            "uptime_in_millis": 4891493,
            "mem": {
               "heap_used_in_bytes": 1389320536,
               "heap_used_percent": 4,
               "heap_committed_in_bytes": 34272509952,
               "heap_max_in_bytes": 34272509952,
               "non_heap_used_in_bytes": 106952112,
               "non_heap_committed_in_bytes": 114794496,
               "pools": {
                  "young": {
                     "used_in_bytes": 194574376,
                     "max_in_bytes": 697958400,
                     "peak_used_in_bytes": 697958400,
                     "peak_max_in_bytes": 697958400
                  },
                  "survivor": {
                     "used_in_bytes": 59550832,
                     "max_in_bytes": 87228416,
                     "peak_used_in_bytes": 87228416,
                     "peak_max_in_bytes": 87228416
                  },
                  "old": {
                     "used_in_bytes": 1135195328,
                     "max_in_bytes": 33487323136,
                     "peak_used_in_bytes": 1135195328,
                     "peak_max_in_bytes": 33487323136
                  }
               }
            },
            "threads": {
               "count": 102,
               "peak_count": 155
            },
            "gc": {
               "collectors": {
                  "young": {
                     "collection_count": 382,
                     "collection_time_in_millis": 30488
                  },
                  "old": {
                     "collection_count": 1,
                     "collection_time_in_millis": 125
                  }
               }
            },
            "buffer_pools": {
               "direct": {
                  "count": 173,
                  "used_in_bytes": 428336975,
                  "total_capacity_in_bytes": 428336974
               },
               "mapped": {
                  "count": 984,
                  "used_in_bytes": 47922081250,
                  "total_capacity_in_bytes": 47922081250
               }
            },
            "classes": {
               "current_loaded_count": 11741,
               "total_loaded_count": 11741,
               "total_unloaded_count": 0
            }
         },
"thread_pool": {
            "bulk": {
               "threads": 12,
               "queue": 0,
               "active": 1,
               "rejected": 816,
               "largest": 12,
               "completed": 41086
            },
            "fetch_shard_started": {
               "threads": 1,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 24,
               "completed": 518
            },
            "fetch_shard_store": {
               "threads": 1,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 24,
               "completed": 469
            },
            "flush": {
               "threads": 1,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 5,
               "completed": 82
            },
            "force_merge": {
               "threads": 0,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 0,
               "completed": 0
            },
            "generic": {
               "threads": 13,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 13,
               "completed": 14142
            },
            "get": {
               "threads": 0,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 0,
               "completed": 0
            },
            "index": {
               "threads": 0,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 0,
               "completed": 0
            },
            "listener": {
               "threads": 0,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 0,
               "completed": 0
            },
            "management": {
               "threads": 3,
               "queue": 0,
               "active": 1,
               "rejected": 0,
               "largest": 3,
               "completed": 5586
            },
            "refresh": {
               "threads": 6,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 6,
               "completed": 232032
            },
            "search": {
               "threads": 1,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 1,
               "completed": 1
            },
            "snapshot": {
               "threads": 0,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 0,
               "completed": 0
            },
            "warmer": {
               "threads": 2,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 5,
               "completed": 1570
            },
            "watcher": {
               "threads": 0,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 0,
               "completed": 0
            }
         }
      }

above is the JVM result of one of the cluster , so much rejected requests