We are using this template to ingest data once a day from BigQuery to Elastic Search.
It creates a dataflow job using the following relevant parameters:
"usePartialUpdate": "true", "batchSizeBytes": "5242880", "bulkInsertMethod": "INDEX", "maxNumWorkers": "30", "workerMachineType": "n1-standard-1"
Total Index size: 70 million rows.
Job updates daily: 7 million rows.
Index refreshes every 30 minutes.
Index snapshotting happens once a day, outside the ingestion time.
By only changing the parameter
True, we see a drop from writing around
7,000 records/second to
How come sending an update to on one field from a record is slower than sending the entire record to overwrite?
Elastic cluster size: 180 GB storage | 4 GB RAM | Up to 8 vCPU - Single Zone. Elastic version: v8.4.3 Elastic is managed through GCP marketplace.
BigQuery, Dataflow and Elastic are all in the same
europe-west1 GCP region.
I found a few references from a few years ago which I hope they were fixed by now.