Setup:
----All machines hosted on AWS EC2----
3 Dedicated Masters (15 Gb Ram)
3 D/I Nodes (64 Gb Ram 30 Dedicated to Heap)
1 Network Load Balancer
Scenario:
We are moving our Elastic Search cluster from a hosted service to an internally managed cluster. In doing so we decided to re-architect some of our indexes into "per customer indexes" we now have approx. 500 indexes with 4 shards and 1 replica for each. I have been migrating our data from the old cluster into the new one, and I have managed to successfully migrate 1.2 billion of the 1.7 billion documents. Up until this point there were only minor issues that were easily resolved by simple script refactoring.
Issue:
I cannot index any more documents without getting a "Circuit Breaking Exception". I have read all the documentation around Circuit breaking exceptions, but have not found a solution as of yet. I have set the field data cache size to 50% with the breaker limit set to 60% and the total limit set to 70%. The problem persists.
Maybe Helpful information:
-
I ran
GET _stats/_all
before running the script and during the script (before it crashes). The text files are too large to put in here so I uploaded them to google drive. This may provide some valuable insight?
https://drive.google.com/open?id=11gTj3F_A24_6MgwbtZvL1GLjfhIseYwl
I Included the_all
section and theindex
at this stage in the migration -
The cluster is not currently in use, it is only being written to with the occasional query to check the status of the migration.
-
The circuit breaking exception stats that the data would be 20.9 gb which exceeds the limit of 20.9 gb. since 20.9 gb is approx. 70% of the available heap space, I believe it is the parent circuit breaker that is tripping as it defaults to 70% of heap
-
(for those of you familiar with python bulk api): This is the bulk helper I have used to insert the first 1.2 billion without failure
helpers.bulk(es, generator(account_id), chunk_size=100000, max_retries=3)
-
I have tried reducing chunk size and that did not solve it.
Exception:
`[2018-03-15T00:23:54,871][WARN ][o.e.a.b.TransportShardBulkAction] [es-data-1] [[583cac778e80276912b44300-breadcrumbs_v1][2]] failed to perform indices:data/write/bulk[s] on replica [583cac778e80276912b$
org.elasticsearch.transport.RemoteTransportException: [es-data-2][172.31.3.107:9300][indices:data/write/bulk[s][r]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [22500182661/20.9gb], which is larger than the limit of [2249976709...