I get the following exception while I am trying to insert data using ES-Hive Hadoop jar. I am currently inserting around 60 million data.
Following is the error I get
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: org.elasticsearch.hadoop.rest.EsHadoopRemoteException: circuit_breaking_exception: [parent] Data too large, data for [<http_request>] would be [31820340712/29.6gb], which is larger than the limit of [31621696716/29.4gb], real usage: [31818275608/29.6gb], new bytes reserved: [2065104/1.9mb], usages [inflight_requests=105523462/100.6mb, request=0/0b, fielddata=0/0b, eql_sequence=0/0b, model_inference=0/0b]
Elasticsearch.log shows only the following:
[2022-05-25T04:13:01,315][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [] GC did bring memory usage down, before [31636006904], after [30392212248], allocations [19], duration [137]
[2022-05-25T04:13:04,065][INFO ][o.e.m.j.JvmGcMonitorService] [] [gc][159765] overhead, spent [300ms] collecting in the last [1s]
[2022-05-25T04:13:06,886][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [] attempting to trigger G1GC due to high heap usage [32059682216]
[2022-05-25T04:13:07,061][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [] GC did bring memory usage down, before [32059682216], after [30969353816], allocations [58], duration [175]
[2022-05-25T04:13:12,337][INFO ][o.e.i.b.HierarchyCircuitBreakerService] [] attempting to trigger G1GC due to high heap usage [31690774104] ```
1. My shard allocation is 1.
2. Replica is 1.
3. JVM size allocated is max at i.e. 31 GB.
Stats below:
name id node.role heap.current heap.percent heap.max
xxxxx xx xxx 27.5gb 88 31gb
What can I do to fix it other than adding another node to the cluster.
But I am using hive hadoop jar that uses bulk internally, can you please help in how can I reduce the size of index request in that case.
Also when I am moving data from hive the data size on disk is ~ 35 GB which when moved to Elasticsearch shows disk size of 500GB. Why is this happening. Is it something that is expected from this conversion?
Thanks!! Any idea on this?
Also when I am moving data from hive the data size on disk is ~ 35 GB which when moved to Elasticsearch shows disk size of 500GB. Why is this happening. Is it something that is expected from this conversion?
I added a new node to the ES cluster with 3 TB space, I am still stuck on the error:
Caused by: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: org.elasticsearch.hadoop.rest.EsHadoopRemoteException: circuit_breaking_exception: [parent] Data too large, data for [<http_request>] would be [31656845952/29.4gb], which is larger than the limit of [31621696716/29.4gb], real usage: [31654767576/29.4gb], new bytes reserved: [2078376/1.9mb], usages [eql_sequence=0/0b, fielddata=32168/31.4kb, request=0/0b, inflight_requests=333681778/318.2mb, model_inference=0/0b]
But at the end of the error it also gives this error:
, Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:1004, Vertex vertex_1654243481653_2108_13_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 (state=08S01,code=2)
Closing: 0: jdbc:hive2://datanode0..com:2181,datanode..com:2181,master..com:2181,master010..com:2181,master010..com:81/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2 ```
Is this error due to space crunch of my cluster? Or is it due to ES circuit breaker? Do you have any idea on that, cause even after adding new node the error doesn't seem to go away.
I found this documentation from ES-Hadoop jar, can this be helpful, if I try reducing the batch entry size/batch size bytes?
Size (in bytes) for batch writes using Elasticsearch bulk API. Note the bulk size is allocated per task instance. Always multiply by the number of tasks within a Hadoop job to get the total bulk size at runtime hitting Elasticsearch.
es.batch.size.entries (default 1000)
Size (in entries) for batch writes using Elasticsearch bulk API - (0 disables it). Companion to es.batch.size.bytes, once one matches, the batch update is executed. Similar to the size, this setting is per task instance; it gets multiplied at runtime by the total number of Hadoop tasks running.```
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.