Hi @danielyahn, thanks for posting your questions on here. I'll try to answer them in order
Is it recommended to set
index.refresh_interval to -1, when
It depends (tm).
index.refresh_interval is the setting on Elasticsearch that informs it of how often to refresh the index for searching purposes. Usually this involves closing out the underlying lucene writer for a segment and making it available to the search system. By default, this is at 1 second, and there are some optimizations to make it so that refreshes for indices that aren't searched often are skipped until traffic increases. If you have SLA's around how quickly data that has been indexed is able to be visible for search operations, usually you would set this value to something that would be close to or below that time value. Usually for high rates of ingestion, we suggest that people set this value to be much higher. How much higher? It depends (tm) again. Normally this is tune-and-test territory to find the right amount of time.
es.batch.write.refresh setting is indeed part of ES-Hadoop, and informs the connector of whether or not it should perform a refresh operation after tasks have completed. Often times, if you are tuning the refresh interval for an index, it is advised to disable this setting (
=false) since ES-Hadoop will submit refresh operations to the cluster without regard for the index's refresh interval.
When running Spark job, how often does refresh API get called? Is it per Spark job, stage or task? Or something smaller?
If enabled, ES-Hadoop will perform a refresh operation on all indices it is writing to at the conclusion of the task that is performing the write operation. For Spark, this is usually on the task scale.
Is the refresh API call per index (or per shard)?
The refresh call is per index, and refreshes all shards under it. This makes the refresh operation a bit costly, and thus we suggest turning it off if you are tuning refresh interval at all.
When multiple tasks finish around the same time, is it possible that the code above is calling refresh API multiple times at the same time? Could it cause performance issue on ES side?
This is correct - Since multiple tasks may finish around the same time, it is often the case that multiple refresh operations are fired off in rapid succession. For most batch workflows, this is not a problem since the job has concluded and a refresh is usually appropriate, but for workflows that expect to continue writing after a task completes, this is non optimal. Refreshing too often can cause Elasticsearch to close out write operations with fewer documents, thus incurring a higher IO cost as it writes smaller files to disk.
The default for this setting in ES-Hadoop is more informed by the desire to be easy to start working with, and tunable later. Thus we set it to true by default.
Is there a known performance issue with ES-Hadoop's refresh call when there's a large parallelism?
As mentioned above, rapid calls to refresh while indexing is on going incurs higher IO costs as Elasticsearch writes files to make documents available for search. In high paralellism environments, it is suggested to disable automatic refresh in ES-Hadoop, and configure the refresh interval on the index to a reasonable amount of time.
I'm running into the following problem. I'm seeing "Read" timeouts with Flush API calls when ingesting to ES using SPARK with the default setting
es.batch.write.refresh=true And it doesn't seem to relate to tuning Bulk request size because when I set
index.refresh_interval='30s' I don't get the timeout. What's the best practice for configuring refresh when writing a Spark ingest job?
If it is not failing, usually the defaults are fine. If you are noticing a decrease in speed due to higher IO load on Elasticsearch, we suggest tuning the refresh rate and automatic refresh at that point. Clearly, since it is failing due to timeouts, you should for sure disable the ES-Hadoop automatic refresh and rely on the Elasticsearch index refresh rate to perform the refresh operation. This makes even more sense when you are ingesting from Streaming since Spark's streaming architecture is microbatch based. Elasticsearch will be able to better batch up documents for refresh on its own across multiple microbatches compared to receiving constant refresh operations.