I'm trying to get a single line JSON datasource of around some 26M records, apply some logic (two filters and then a "select" to get the desired results -6 fields at all-) and save them in ES... so far so good.
Tha problem seems to be the EsSparkSQL.saveToEs() which always raises the circuit breaker exception. If this connector is responsible of doing whatever it does (I don't really know why this connector takes that large amount of tasks/jobs to save already "formated" data, except it is not JSON) and then save to ES why this exception is raised? Shouldn't it be smart enough to check the max data size and flush the bulk before that limit is exceed?
This is the exact exception (bailing out):
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: circuit_breaking_exception: [parent] Data too large, data for [<transport_request>] would be [259268254/247.2mb], which is larger than the limit of [254332108/242.5mb], real usage: [258560280/246.5mb], new bytes reserved: [707974/691.3kb], usages [request=0/0b, fielddata=38332/37.4kb, in_flight_requests=1395602/1.3mb, accounting=5724890/5.4mb]
I've also tried to set the breaker limit up to 99% but with same result. KO.
String payload = "{\"persistent\" : {\"indices.breaker.total.limit\" : \"99%\"}}";
StringRequestEntity requestEntity = new StringRequestEntity(payload, "application/json", "UTF-8");
PutMethod putMethod = new PutMethod(host + CLUSTER_SETTINGS_ENDPOINT);
putMethod.setRequestEntity(requestEntity);
int statusCode = httpClient.executeMethod(putMethod);
If answer is no, how can I fix it?
Thanks.