Did I get to the upper limit of /bulk upload API?

small-tomorrow · November 19, 2016, 11:44am

fter upgrade from V 2.1.0 to V 5.0.0, I seemed to be so easy to got a OOM ERROR using /bulk api to upload data .few minutes after my service started, I would receive this :' [gc][91065] overhead, spent [16.5s] collecting in the last [16.5s]
java.lang.OutOfMemoryError: Java heap space ' . What is the upper limit of ／bulk API ?
I have just gave 32G to ES, and my total memory size is 90G, more , part of my elasticsearch.yml is :

script.engine.groovy.inline.aggs: on

indices.fielddata.cache.size: 40%
indices.breaker.fielddata.limit: 60%
indices.breaker.request.limit: 40%
indices.breaker.total.limit: 70%
thread_pool.bulk.queue_size: 5000

However, I hava 1 billion of items(documents) to upload every day , any advice ?

dadoonet · November 19, 2016, 12:19pm

You are storing up to 5000 bulk requests in memory here?
I'm guessing you are consuming too much memory.

Don't change pool settings unless you really know what you are doing.

What is your hardware? What are your index settings?
How many nodes?

small-tomorrow · November 21, 2016, 5:41am

**I have 25 nodes ，and each machine‘s Memory is 100G, and SSD is 2T : **
The index _setting return :

"mobile_1234648_20161119": {
"settings": {
"index": {
"creation_date": "1479537360203",
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "38jRv1wwTx2JTsohl7EnJQ",
"version": {
"created": "5000099"
},
"provided_name": "mobile_1234648_20161119"
}
}
}

however, there are many type with keyword in the mapping , which is string in my template
and follow is thread_pool if it can help :

"ZntwknM8Tn-J3rLGiNXA6A": {
"name": "es-nmg02-vs-image160",
"transport_address": "10.107.29.28:9300",
"host": "10.107.29.28",
"ip": "10.107.29.28",
"version": "5.0.0",
"build_hash": "253032b",
"roles": [
"master",
"data",
"ingest"
],
"thread_pool": {
"watcher": {
"type": "fixed",
"min": 60,
"max": 60,
"queue_size": 1000
},
"force_merge": {
"type": "fixed",
"min": 1,
"max": 1,
"queue_size": -1
},
"fetch_shard_started": {
"type": "scaling",
"min": 1,
"max": 24,
"keep_alive": "5m",
"queue_size": -1
},
"listener": {
"type": "fixed",
"min": 6,
"max": 6,
"queue_size": -1
},
"index": {
> "type": "fixed",
> "min": 12,
> "max": 12,
> "queue_size": 200
> },
"refresh": {
"type": "scaling",
"min": 1,
"max": 6,
"keep_alive": "5m",
"queue_size": -1
},
"generic": {
"type": "scaling",
"min": 4,
"max": 128,
"keep_alive": "30s",
"queue_size": -1
},
"warmer": {
"type": "scaling",
"min": 1,
"max": 5,
"keep_alive": "5m",
"queue_size": -1
},
"search": {
"type": "fixed",
"min": 19,
"max": 19,
"queue_size": 1000
},
"flush": {
"type": "scaling",
"min": 1,
"max": 5,
"keep_alive": "5m",
"queue_size": -1
},
"fetch_shard_store": {
"type": "scaling",
"min": 1,
"max": 24,
"keep_alive": "5m",
"queue_size": -1
},
"management": {
"type": "scaling",
"min": 1,
"max": 5,
"keep_alive": "5m",
"queue_size": -1
},
"get": {
"type": "fixed",
"min": 12,
"max": 12,
"queue_size": 1000
},
"bulk": {
> "type": "fixed",
> "min": 12,
> "max": 12,
> "queue_size": 5000
> },
"snapshot": {
"type": "scaling",
"min": 1,
"max": 5,
"keep_alive": "5m",
"queue_size": -1
}
}
}

dadoonet · November 21, 2016, 6:43am

What is the indexation rate on a single empty node with 1 shard and 0 replica. How many docs you can hold in this single index until it reaches a bad response time bar?

Please do the tests with all default unmodified settings.

What is the number of docs you need to index per second?

The next test is IMO to increase the number of shards (up to 25 as you have 25 data nodes) and see how it goes.
Also try to play with your bulk size (on the client).

Check that your tool which is generating data is fast enough as well.

small-tomorrow · November 21, 2016, 8:09am

you mean that I have to figure out the limit of one node ,and then figure out the total limit by test ? to see whether I should expand nodes's number or what else ?

small-tomorrow · November 21, 2016, 8:10am

I am confused about how index action will do ? cause when I post data to one of the nodes , and that node will receive a OOM later , but other nodes are all ok ! will ES reroutes or does something to balance the index requests ? or is there a best practice to make it balanceable ?

Christian_Dahlqvist · November 21, 2016, 8:37am

I have a few questions: How large are your documents? How large are your bulk requests? How many concurrent indexing threads/connections do you have? Are you distributing bulk requests evenly across all data nodes? What indexing throughput are you seeing, especially if you set the bulk indexing queue size back to its default value?

dadoonet · November 21, 2016, 8:37am

Yes.

You still have OOM after removing your threadpool settings?

You are trying here to solve a problem but using the wrong path.
I mean that you said you have an index rate problem apparently so you increased the queue size for bulk. This is the wrong solution IMO.

small-tomorrow · November 21, 2016, 9:45am

Cause the ES service was developed by other guys , and the ES service is online anyway (I have set the bulk indexing queue to default ) , the answers to your questions are:
1.Bulk request size : 500
2.on average, the concurrency is : 1 / s , but the peak value is depended by how much data is post by user clients
3.the requests do not distribute across all data nodes
4.the index rate is about 20,000 per second according to Kibana monitoring
5. the documents size is about 20kb

small-tomorrow · November 21, 2016, 9:45am

Yes , still OOM after removing thread_pool setting . You are right , and increasing queue size is not the solution , but I wonder what the results are of official performance tests .

small-tomorrow · November 22, 2016, 8:05am

It's said that ES 5.0 has a great improvement in performance , how can the bulk requests crash the services when it works fine with ES 2.1.0 ?

dadoonet · November 22, 2016, 9:29am

If I'm not mistaken, you have:

document size: 20kb
bulk size: 500, so 500 * 20kb = 10mb
bulk queue: 5000: 5000 * 10mb = 50gb

So again, my guess is that you are hitting OOM issues because of that.

small-tomorrow · November 22, 2016, 9:48am

I have tried 1000 bulk queue size , and the same problem , anyway , the JVM keeps collect garbage and after it receive " [gc][392] overhead, spent [251ms] collecting in the last [1s] " for a while , OOM occures ! Well , I think requests is too frequent for my service ! But It works fine when I change upload URL to ES 2.1.0 .
should I decrease the requests frequency ? there must be some differences with it between V2.1 and V5.0

dadoonet · November 22, 2016, 10:31am

And why not keeping the default value for queue size?

I mean, can you use a default untouched version of elasticsearch 5.0 and send the bulk requests as you did previously and see what is happening.

If you have bulk rejections, then let's work on fixing that instead of increasing the queue size.

small-tomorrow · November 23, 2016, 2:11am

Yes , I did as you said , keep the default queue_size and OOM still here ! however , I checked the circuit breaker , I found this :

"breakers": {
"request": {
"limit_size_in_bytes": 13709003980,
"limit_size": "12.7gb",
"estimated_size_in_bytes": 84059360,
"estimated_size": "80.1mb",
"overhead": 1,
"tripped": 0
},
"fielddata": {
"limit_size_in_bytes": 20563505971,
"limit_size": "19.1gb",
"estimated_size_in_bytes": 25376,
"estimated_size": "24.7kb",
"overhead": 1.03,
"tripped": 0
},
"in_flight_requests": {
"limit_size_in_bytes": 34272509952,
"limit_size": "31.9gb",
"estimated_size_in_bytes": 0,
"estimated_size": "0b",
"overhead": 1,
"tripped": 0
},
"parent": {
"limit_size_in_bytes": 23990756966,
"limit_size": "22.3gb",
"estimated_size_in_bytes": 84084736,
"estimated_size": "80.1mb",
"overhead": 1,
"tripped": 0
}
}

1.dose it mean the breaker consume too much memory as it plused to be 85.8G=12.7+19.1+31.8+22.2 G ? and there is no in_flight_requests breaker in ES 2.1.0 .
2. It just keep GC like below ,and soon or later it will receive OOM

[2016-11-23T10:20:13,983][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][332] overhead, spent [265ms] collecting in the last [1s]

small-tomorrow · November 23, 2016, 5:07am

Here is the log I got with the default queue_size :

[2016-11-23T12:50:36,222][INFO ][o.e.h.HttpServer ] [es-nmg02-jpaas212] publish_address {10.107.49.21:8201}, bound_addresses {10.107.49.21:8201}
[2016-11-23T12:50:36,223][INFO ][o.e.n.Node ] [es-nmg02-jpaas212] started
[2016-11-23T12:50:38,750][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][7] overhead, spent [340ms] collecting in the last [1s]
[2016-11-23T12:50:40,789][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][9] overhead, spent [305ms] collecting in the last [1s]
[2016-11-23T12:51:50,672][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][78] overhead, spent [298ms] collecting in the last [1s]
[2016-11-23T12:52:05,616][INFO ][o.e.c.s.ClusterService ] [es-nmg02-jpaas212] removed {{es-nmg02-mco-bu-cloud-storage8}{9FS_jJV5TyeFu3PVSOMa4w}{h6MXXwu0Tni-TC_166ki_g}{10.107.133.24}{10.107.133.24:9300},}, reason: zen-disco-receive(from master [master {es-nmg02-ecom-jpaas049}{wX-6uJa2TeOo3-7_a9CcXg}{93NMOOgERdO8Soai8-2uRQ}{10.107.26.24}{10.107.26.24:9300} committed version [427]])
[2016-11-23T12:52:06,789][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][94] overhead, spent [282ms] collecting in the last [1s]
[2016-11-23T12:52:10,342][INFO ][o.e.c.s.ClusterService ] [es-nmg02-jpaas212] added {{es-nmg02-mco-bu-cloud-storage8}{9FS_jJV5TyeFu3PVSOMa4w}{h6MXXwu0Tni-TC_166ki_g}{10.107.133.24}{10.107.133.24:9300},}, reason: zen-disco-receive(from master [master {es-nmg02-ecom-jpaas049}{wX-6uJa2TeOo3-7_a9CcXg}{93NMOOgERdO8Soai8-2uRQ}{10.107.26.24}{10.107.26.24:9300} committed version [431]])
[2016-11-23T12:53:19,692][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][166] overhead, spent [288ms] collecting in the last [1.1s]
[2016-11-23T12:53:42,929][WARN ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][181] overhead, spent [8.6s] collecting in the last [9.1s]
[2016-11-23T12:53:43,930][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][182] overhead, spent [351ms] collecting in the last [1s]
[2016-11-23T12:54:01,352][WARN ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][191] overhead, spent [8.5s] collecting in the last [9.4s]
[2016-11-23T12:54:01,459][ERROR][o.e.x.m.AgentService ] [es-nmg02-jpaas212] exception when exporting documents
org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.doFlush(ExportBulk.java:148) ~[x-pack-5.0.0.jar:5.0.0]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.close(ExportBulk.java:77) ~[x-pack-5.0.0.jar:5.0.0]
at org.elasticsearch.xpack.monitoring.exporter.Exporters.export(Exporters.java:194) ~[x-pack-5.0.0.jar:5.0.0]
at org.elasticsearch.xpack.monitoring.AgentService$ExportingWorker.run(AgentService.java:208) [x-pack-5.0.0.jar:5.0.0]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_45]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulk [default_local]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.doFlush(LocalBulk.java:114) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.flush(ExportBulk.java:62) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.doFlush(ExportBulk.java:145) ~[?:?]
... 4 more
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:121) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.doFlush(LocalBulk.java:111) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk.flush(ExportBulk.java:62) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.doFlush(ExportBulk.java:145) ~[?:?]
... 4 more
[2016-11-23T12:54:10,491][INFO ][o.e.m.j.JvmGcMonitorService] [es-nmg02-jpaas212] [gc][old][192][21] duration [8.8s], collections [1]/[9.1s]
......

And again , it will receive OOM later .
Btw, according to the reference of V5.0 , the default queue_size is 50 ,and I checked that . It is too small IMO .

dadoonet · November 23, 2016, 8:09am

Can you run:

GET _nodes/stats/jvm,thread_pool?pretty

And please format your code using </> icon. It will make your post more readable.

small-tomorrow · November 23, 2016, 11:39am

"8m1coAQ8TrKSUWOsQExSgg": {
         "timestamp": 1479901823669,
         "name": "es-nmg02",
         "transport_address": "10.107.49.59:9300",
         "host": "10.107.49.59",
         "ip": "10.107.49.59:9300",
         "roles": [
            "master",
            "data",
            "ingest"
         ],
         "jvm": {
            "timestamp": 1479901823669,
            "uptime_in_millis": 4891493,
            "mem": {
               "heap_used_in_bytes": 1389320536,
               "heap_used_percent": 4,
               "heap_committed_in_bytes": 34272509952,
               "heap_max_in_bytes": 34272509952,
               "non_heap_used_in_bytes": 106952112,
               "non_heap_committed_in_bytes": 114794496,
               "pools": {
                  "young": {
                     "used_in_bytes": 194574376,
                     "max_in_bytes": 697958400,
                     "peak_used_in_bytes": 697958400,
                     "peak_max_in_bytes": 697958400
                  },
                  "survivor": {
                     "used_in_bytes": 59550832,
                     "max_in_bytes": 87228416,
                     "peak_used_in_bytes": 87228416,
                     "peak_max_in_bytes": 87228416
                  },
                  "old": {
                     "used_in_bytes": 1135195328,
                     "max_in_bytes": 33487323136,
                     "peak_used_in_bytes": 1135195328,
                     "peak_max_in_bytes": 33487323136
                  }
               }
            },
            "threads": {
               "count": 102,
               "peak_count": 155
            },
            "gc": {
               "collectors": {
                  "young": {
                     "collection_count": 382,
                     "collection_time_in_millis": 30488
                  },
                  "old": {
                     "collection_count": 1,
                     "collection_time_in_millis": 125
                  }
               }
            },
            "buffer_pools": {
               "direct": {
                  "count": 173,
                  "used_in_bytes": 428336975,
                  "total_capacity_in_bytes": 428336974
               },
               "mapped": {
                  "count": 984,
                  "used_in_bytes": 47922081250,
                  "total_capacity_in_bytes": 47922081250
               }
            },
            "classes": {
               "current_loaded_count": 11741,
               "total_loaded_count": 11741,
               "total_unloaded_count": 0
            }
         },

small-tomorrow · November 23, 2016, 11:40am

"thread_pool": {
            "bulk": {
               "threads": 12,
               "queue": 0,
               "active": 1,
               "rejected": 816,
               "largest": 12,
               "completed": 41086
            },
            "fetch_shard_started": {
               "threads": 1,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 24,
               "completed": 518
            },
            "fetch_shard_store": {
               "threads": 1,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 24,
               "completed": 469
            },
            "flush": {
               "threads": 1,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 5,
               "completed": 82
            },
            "force_merge": {
               "threads": 0,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 0,
               "completed": 0
            },
            "generic": {
               "threads": 13,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 13,
               "completed": 14142
            },
            "get": {
               "threads": 0,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 0,
               "completed": 0
            },
            "index": {
               "threads": 0,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 0,
               "completed": 0
            },
            "listener": {
               "threads": 0,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 0,
               "completed": 0
            },
            "management": {
               "threads": 3,
               "queue": 0,
               "active": 1,
               "rejected": 0,
               "largest": 3,
               "completed": 5586
            },
            "refresh": {
               "threads": 6,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 6,
               "completed": 232032
            },
            "search": {
               "threads": 1,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 1,
               "completed": 1
            },
            "snapshot": {
               "threads": 0,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 0,
               "completed": 0
            },
            "warmer": {
               "threads": 2,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 5,
               "completed": 1570
            },
            "watcher": {
               "threads": 0,
               "queue": 0,
               "active": 0,
               "rejected": 0,
               "largest": 0,
               "completed": 0
            }
         }
      }

small-tomorrow · November 23, 2016, 11:41am

above is the JVM result of one of the cluster , so much rejected requests

Topic		Replies	Views
Attempt to index a large dataset fails Elasticsearch	12	549	July 6, 2017
Large Scale elastic Search Logstash collection system Elasticsearch	6	464	July 6, 2017
ES bulk insert time out Elasticsearch	9	12623	July 6, 2017
Rejected execution (queue capacity 50) in bulk process Elasticsearch	11	3640	July 6, 2017
ElasticSearch can't automatically recover after a big HEAP utilization Elasticsearch	3	413	July 6, 2017

Did I get to the upper limit of /bulk upload API?

Related topics