Bulk API parameters

Ashish_Mishra · July 28, 2014, 11:02pm

I'm uploading documents using syntax like the following.

curl -XPOST 'http://localhost:9200/test/type1/_bulk' -d '
{ "index" : { "_id" : "i1", "version": 3, "version_type": "external",
"replication": "async", "timeout": "5m" } }
{ "fields": "values etc." }
{ "index" : { "_id" : "i2", "version": 1, "version_type": "external",
"replication": "async", "timeout": "5m" } }
{ "fields": "values etc." }
'

A couple of questions: First, there's a fair bit of redundancy in the
action line. It feels wasteful when sending 10s of Mb / thousands of
requests per API call.
Can I roll default version_type / replication / timeout parameters into the
top-level _bulk url? I've seen a few resolved issues suggesting this. But
it's not mentioned in the documentation at
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html

Second, in the response I occasionally see errors like
{"index":"test","_type":"type1","_id":"123","status":503,"error":"UnavailableShardsException[[test][98]
[3] shardIt, [3] active : Timeout waiting for [0s], request:
org.elasticsearch.action.bulk.BulkShardRequest@36d185a1]"}

The "[0s]" part is surprising. The available-shard-timeout is 1m by
default, and I explicitly requested 5m. Does this get overridden somewhere?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ba4bde17-2668-42c4-9d14-0923571044d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · July 29, 2014, 7:21am

You can use "version" and "version_type" per doc, of course.

The parameters "replication" and "timeout" per doc are ignored when using
bulk mode. They must be set at bulk request level.

Each bulk request is split and forwarded to relevant shards. This splitting
is very fast by searching delimiters in the request chunk, sorting the
actions that belong to one shard, and forward them as new packets. For
these packets, the bulk request level parameters "replication" and
"timeout" should work.

Although the request format looks heavy, it is most appropriate for
distributed processing.

Jörg

On Tue, Jul 29, 2014 at 1:02 AM, Ashish Mishra laughingbuddha@gmail.com
wrote:

I'm uploading documents using syntax like the following.

curl -XPOST 'http://localhost:9200/test/type1/_bulk' -d '
{ "index" : { "_id" : "i1", "version": 3, "version_type": "external",
"replication": "async", "timeout": "5m" } }
{ "fields": "values etc." }
{ "index" : { "_id" : "i2", "version": 1, "version_type": "external",
"replication": "async", "timeout": "5m" } }
{ "fields": "values etc." }
'

A couple of questions: First, there's a fair bit of redundancy in the
action line. It feels wasteful when sending 10s of Mb / thousands of
requests per API call.
Can I roll default version_type / replication / timeout parameters into
the top-level _bulk url? I've seen a few resolved issues suggesting this.
But it's not mentioned in the documentation at
Elasticsearch Platform — Find real-time answers at scale | Elastic

Second, in the response I occasionally see errors like
{"index":"test","_type":"type1","_id":"123","status":503,"error":"UnavailableShardsException[[test][98]
[3] shardIt, [3] active : Timeout waiting for [0s], request:
org.elasticsearch.action.bulk.BulkShardRequest@36d185a1]"}

The "[0s]" part is surprising. The available-shard-timeout is 1m by
default, and I explicitly requested 5m. Does this get overridden somewhere?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ba4bde17-2668-42c4-9d14-0923571044d5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ba4bde17-2668-42c4-9d14-0923571044d5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE6Gv43RWUS2h_gdW%2BJ0bGY_rE88XsFzOPJW_WO-a9WCg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Ashish_Mishra · July 29, 2014, 9:25pm

Just to be sure I understand -- are you suggesting the following syntax:

curl -XPOST
'http://localhost:9200/test/type1/_bulk?replication=async&timeout=5m' -d '
{ "index" : { "_id" : "i1", "version": 3, "version_type": "external" } }
{ "fields": "values etc." }
'

For my use case, version_type is always "external" for all documents in the
request. But I get the motivation for specifying it per-doc.

You said that timeout per-doc is ignored in bulk mode. So the
Elasticsearch default timeout, i.e. [1m] should have been applied to my
original requests.
Do you know why a [0s] timeout was applied instead? This was from a
response:

{"index":"test","_type":"type1","_id":"123","status":503,"error":"UnavailableShardsException[[test][98]
[3] shardIt, [3] active : Timeout waiting for [0s], request:
org.elasticsearch.action.bulk.BulkShardRequest@36d185a1]"}

On Tuesday, July 29, 2014 12:21:21 AM UTC-7, Jörg Prante wrote:

You can use "version" and "version_type" per doc, of course.

The parameters "replication" and "timeout" per doc are ignored when using
bulk mode. They must be set at bulk request level.

Each bulk request is split and forwarded to relevant shards. This
splitting is very fast by searching delimiters in the request chunk,
sorting the actions that belong to one shard, and forward them as new
packets. For these packets, the bulk request level parameters "replication"
and "timeout" should work.

Although the request format looks heavy, it is most appropriate for
distributed processing.

Jörg

On Tue, Jul 29, 2014 at 1:02 AM, Ashish Mishra <laughin...@gmail.com
<javascript:>> wrote:

I'm uploading documents using syntax like the following.

curl -XPOST 'http://localhost:9200/test/type1/_bulk' -d '
{ "index" : { "_id" : "i1", "version": 3, "version_type": "external",
"replication": "async", "timeout": "5m" } }
{ "fields": "values etc." }
{ "index" : { "_id" : "i2", "version": 1, "version_type": "external",
"replication": "async", "timeout": "5m" } }
{ "fields": "values etc." }
'

A couple of questions: First, there's a fair bit of redundancy in the
action line. It feels wasteful when sending 10s of Mb / thousands of
requests per API call.
Can I roll default version_type / replication / timeout parameters into
the top-level _bulk url? I've seen a few resolved issues suggesting this.
But it's not mentioned in the documentation at
Elasticsearch Platform — Find real-time answers at scale | Elastic

Second, in the response I occasionally see errors like
{"index":"test","_type":"type1","_id":"123","status":503,"error":"UnavailableShardsException[[test][98]
[3] shardIt, [3] active : Timeout waiting for [0s], request:
org.elasticsearch.action.bulk.BulkShardRequest@36d185a1]"}

The "[0s]" part is surprising. The available-shard-timeout is 1m by
default, and I explicitly requested 5m. Does this get overridden somewhere?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ba4bde17-2668-42c4-9d14-0923571044d5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ba4bde17-2668-42c4-9d14-0923571044d5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bc6294c4-cca4-4eec-961a-491e6c6c007b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · July 29, 2014, 11:08pm

Yes, this is the syntax I suggest.

Timeout of "0s" is a glitch in the bulk operations, the replica shard level
operations use TransportRequestOptions (transportOptions in
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java?source=c#L685
for timeout, but the bulk request timeout is not propagated into this
class, it's just a null value, which means, extra timeout handling is not
set up.

A fix would be to add the timeout to TransportRequestOptions in BulkAction
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/action/bulk/BulkAction.java#L49

Jörg

On Tue, Jul 29, 2014 at 11:25 PM, Ashish Mishra laughingbuddha@gmail.com
wrote:

Just to be sure I understand -- are you suggesting the following syntax:

curl -XPOST 'http://localhost:9200/test/type1/_bulk?replication=async&timeout=5m'
-d '
{ "index" : { "_id" : "i1", "version": 3, "version_type": "external" } }
{ "fields": "values etc." }
'

For my use case, version_type is always "external" for all documents in
the request. But I get the motivation for specifying it per-doc.

You said that timeout per-doc is ignored in bulk mode. So the
Elasticsearch default timeout, i.e. [1m] should have been applied to my
original requests.
Do you know why a [0s] timeout was applied instead? This was from a
response:

{"index":"test","_type":"type1","_id":"123","status":503,"error":"
UnavailableShardsException[[test][98] [3] shardIt, [3] active : Timeout
waiting for [0s], request: org.elasticsearch.action.bulk.
BulkShardRequest@36d185a1]"}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE5fv_1rZmCEpxgEmj%2B2qoisau5e3fpQD0sLimzMmaKXw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Elasticsearch not respecting the default timeout in bulk api Elasticsearch	1	638	June 25, 2020
Curl timeout during bulk insert Elasticsearch	5	1964	July 6, 2017
Elasticsearch Bulk Indexing with specified index, type in URL not working Elasticsearch	5	1241	July 5, 2017
Bulk update times out Elasticsearch	3	321	July 6, 2017
Bulk Indexing performance questions Elasticsearch	8	3263	August 8, 2017

Bulk API parameters

Related topics