We are using this template to ingest data once a day from BigQuery to Elastic Search.
It creates a dataflow job using the following relevant parameters:
"usePartialUpdate": "true",
"batchSizeBytes": "5242880",
"bulkInsertMethod": "INDEX",
"maxNumWorkers": "30",
"workerMachineType": "n1-standard-1"
Total Index size: 70 million rows.
Job updates daily: 7 million rows.
Index refreshes every 30 minutes.
Index snapshotting happens once a day, outside the ingestion time.
By only changing the parameter usePartialUpdate
from False
to True
, we see a drop from writing around 7,000 records/second
to 1,500 records/second
.
How come sending an update to on one field from a record is slower than sending the entire record to overwrite?
Elastic cluster size: 180 GB storage | 4 GB RAM | Up to 8 vCPU - Single Zone.
Elastic version: v8.4.3
Elastic is managed through GCP marketplace.
BigQuery, Dataflow and Elastic are all in the same europe-west1
GCP region.
I found a few references from a few years ago which I hope they were fixed by now.