1.7 Billion Document Migration - Circuit Breaking Exception


(kurtis) #1

Setup:
----All machines hosted on AWS EC2----
3 Dedicated Masters (15 Gb Ram)
3 D/I Nodes (64 Gb Ram 30 Dedicated to Heap)
1 Network Load Balancer

Scenario:

We are moving our Elastic Search cluster from a hosted service to an internally managed cluster. In doing so we decided to re-architect some of our indexes into "per customer indexes" we now have approx. 500 indexes with 4 shards and 1 replica for each. I have been migrating our data from the old cluster into the new one, and I have managed to successfully migrate 1.2 billion of the 1.7 billion documents. Up until this point there were only minor issues that were easily resolved by simple script refactoring.

Issue:

I cannot index any more documents without getting a "Circuit Breaking Exception". I have read all the documentation around Circuit breaking exceptions, but have not found a solution as of yet. I have set the field data cache size to 50% with the breaker limit set to 60% and the total limit set to 70%. The problem persists.

Maybe Helpful information:

  • I ran GET _stats/_all before running the script and during the script (before it crashes). The text files are too large to put in here so I uploaded them to google drive. This may provide some valuable insight?
    https://drive.google.com/open?id=11gTj3F_A24_6MgwbtZvL1GLjfhIseYwl
    I Included the _all section and the index at this stage in the migration

  • The cluster is not currently in use, it is only being written to with the occasional query to check the status of the migration.

  • The circuit breaking exception stats that the data would be 20.9 gb which exceeds the limit of 20.9 gb. since 20.9 gb is approx. 70% of the available heap space, I believe it is the parent circuit breaker that is tripping as it defaults to 70% of heap

  • (for those of you familiar with python bulk api): This is the bulk helper I have used to insert the first 1.2 billion without failure
    helpers.bulk(es, generator(account_id), chunk_size=100000, max_retries=3)

  • I have tried reducing chunk size and that did not solve it.

Exception:

`[2018-03-15T00:23:54,871][WARN ][o.e.a.b.TransportShardBulkAction] [es-data-1] [[583cac778e80276912b44300-breadcrumbs_v1][2]] failed to perform indices:data/write/bulk[s] on replica [583cac778e80276912b$
org.elasticsearch.transport.RemoteTransportException: [es-data-2][172.31.3.107:9300][indices:data/write/bulk[s][r]]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [<transport_request>] would be [22500182661/20.9gb], which is larger than the limit of [2249976709...

(kurtis) #2

If i have left anything out or you require any more information at all please let me know? I have been working on this project for a week now, and I would really like to finish this migration :smiley:


(kurtis) #3

Shameless self bump. If I have left out any information that would assist anyone please let me know. I am desperate to solve this problem


(David Pilato) #4

Read this and specifically the "Also be patient" part.


(kurtis) #5

Note this isnt intended to offend or be rude, just justifying my actions:

While I wouldn't normally reply in a passive aggressive manor, I feel that you may want to overlook the guidelines you provided since it specifically says that a reminder ping is welcome.

CloudApp

(David Pilato) #6

Sure. It's fine after 2 or 3 days (not including weekends) but not after 5h IMO.


(kurtis) #7

Ok, I will take your 2-3 day rule of thumb into consideration next time. Thanks for everything!


(system) closed #8

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.