Hi forum,
I'm on AWS and trying to write ~ 1.2mio documents from an AWS Glue job Python / pyspark job to a managed ElasticSearch.
To be fair mentioning that I'm using AWS OpenSearch 1.2 but my question is more or less about Elastic-Hadoop integration at all.
The issue I'm facing is that after a while I'm facing "429 Too Many Requests":
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: [PUT] on [MY/doc/_bulk] failed; server[https://SOME_ENDPOINT_HERE] returned [429|Too Many Requests:]
From what I understand and read so far this is pretty much about configuration, throttling down indexing requests on the client side giving the server more time to process queued requests. And that's what I tried but somehow the config on the Hadoop connector does not work for me.
So I tried to send smaller batches of documents to the ElasticSearch and increased retry wait time: set 'es.batch.size.entries' to 100 and 'es.batch.write.retry.wait' to 30s:
df \
.write \
.mode('overwrite') \
.format('org.elasticsearch.spark.sql') \
.option('es.nodes', 'SOME_ENDPOINT_HERE') \
.option('es.port', 443) \
.option('es.net.ssl', 'true') \
.option('es.net.http.auth.user', 'SOME_USER_NAME_HERE') \
.option('es.net.http.auth.pass', 'SOME_PASS_HERE') \
.option('es.nodes.wan.only', 'true') \
.option('es.nodes.discovery', 'false') \
.option('es.resource', 'SOME_NAME_HERE') \
.option('es.index.auto.create', 'true') \
.option('es.mapping.id', 'SOME_FIELD_HERE') \
.option('es.write.operation', 'index') \
.option('es.batch.size.entries', '100') \
.option('es.batch.write.retry.policy', 'simple') \
.option('es.batch.write.retry.count', '-1') \
.option('es.batch.write.retry.limit', '-1') \
.option('es.batch.write.retry.wait', '30s') \
.save()
Already set logging for 'org.elasticsearch.hadoop.rest' logger to DEBUG level:
Bulk Flush #[12653715211658247214404]: Sending batch of [34000] bytes/[1000] entries
Bulk Flush #[12653715211658247214404]: Response received
Bulk Flush #[12653715211658247214404]: Completed. [1000] Original Entries. [1] Attempts. [1000/1000] Docs Sent. [0/1000] Docs Skipped. [0/1000] Docs Aborted.
From what I understand the Hadoop-Connector is sending batches of 1000 documents, not the 100 from my config. Further I can not see any wait time.
My actual setup on AWS is:
Spark: 2.4.3
Python: 3.7
Elasticsearch: 7.10 / OpenSearch 1.2
Elasticsearch Hadoop: 7.13.4
Any hints or ideas on my setup?
Many Thanks,
Matthias