Bulk API parameters


(Ashish Mishra) #1

I'm uploading documents using syntax like the following.

curl -XPOST 'http://localhost:9200/test/type1/_bulk' -d '
{ "index" : { "_id" : "i1", "version": 3, "version_type": "external",
"replication": "async", "timeout": "5m" } }
{ "fields": "values etc." }
{ "index" : { "_id" : "i2", "version": 1, "version_type": "external",
"replication": "async", "timeout": "5m" } }
{ "fields": "values etc." }
'

A couple of questions: First, there's a fair bit of redundancy in the
action line. It feels wasteful when sending 10s of Mb / thousands of
requests per API call.
Can I roll default version_type / replication / timeout parameters into the
top-level _bulk url? I've seen a few resolved issues suggesting this. But
it's not mentioned in the documentation at
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html

Second, in the response I occasionally see errors like
{"index":"test","_type":"type1","_id":"123","status":503,"error":"UnavailableShardsException[[test][98]
[3] shardIt, [3] active : Timeout waiting for [0s], request:
org.elasticsearch.action.bulk.BulkShardRequest@36d185a1]"}

The "[0s]" part is surprising. The available-shard-timeout is 1m by
default, and I explicitly requested 5m. Does this get overridden somewhere?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ba4bde17-2668-42c4-9d14-0923571044d5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #2

You can use "version" and "version_type" per doc, of course.

The parameters "replication" and "timeout" per doc are ignored when using
bulk mode. They must be set at bulk request level.

Each bulk request is split and forwarded to relevant shards. This splitting
is very fast by searching delimiters in the request chunk, sorting the
actions that belong to one shard, and forward them as new packets. For
these packets, the bulk request level parameters "replication" and
"timeout" should work.

Although the request format looks heavy, it is most appropriate for
distributed processing.

Jörg

On Tue, Jul 29, 2014 at 1:02 AM, Ashish Mishra laughingbuddha@gmail.com
wrote:

I'm uploading documents using syntax like the following.

curl -XPOST 'http://localhost:9200/test/type1/_bulk' -d '
{ "index" : { "_id" : "i1", "version": 3, "version_type": "external",
"replication": "async", "timeout": "5m" } }
{ "fields": "values etc." }
{ "index" : { "_id" : "i2", "version": 1, "version_type": "external",
"replication": "async", "timeout": "5m" } }
{ "fields": "values etc." }
'

A couple of questions: First, there's a fair bit of redundancy in the
action line. It feels wasteful when sending 10s of Mb / thousands of
requests per API call.
Can I roll default version_type / replication / timeout parameters into
the top-level _bulk url? I've seen a few resolved issues suggesting this.
But it's not mentioned in the documentation at
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html

Second, in the response I occasionally see errors like
{"index":"test","_type":"type1","_id":"123","status":503,"error":"UnavailableShardsException[[test][98]
[3] shardIt, [3] active : Timeout waiting for [0s], request:
org.elasticsearch.action.bulk.BulkShardRequest@36d185a1]"}

The "[0s]" part is surprising. The available-shard-timeout is 1m by
default, and I explicitly requested 5m. Does this get overridden somewhere?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ba4bde17-2668-42c4-9d14-0923571044d5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ba4bde17-2668-42c4-9d14-0923571044d5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE6Gv43RWUS2h_gdW%2BJ0bGY_rE88XsFzOPJW_WO-a9WCg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Ashish Mishra) #3

Just to be sure I understand -- are you suggesting the following syntax:

curl -XPOST
'http://localhost:9200/test/type1/_bulk?replication=async&timeout=5m' -d '
{ "index" : { "_id" : "i1", "version": 3, "version_type": "external" } }
{ "fields": "values etc." }
'

For my use case, version_type is always "external" for all documents in the
request. But I get the motivation for specifying it per-doc.

You said that timeout per-doc is ignored in bulk mode. So the
Elasticsearch default timeout, i.e. [1m] should have been applied to my
original requests.
Do you know why a [0s] timeout was applied instead? This was from a
response:

{"index":"test","_type":"type1","_id":"123","status":503,"error":"UnavailableShardsException[[test][98]
[3] shardIt, [3] active : Timeout waiting for [0s], request:
org.elasticsearch.action.bulk.BulkShardRequest@36d185a1]"}

On Tuesday, July 29, 2014 12:21:21 AM UTC-7, Jörg Prante wrote:

You can use "version" and "version_type" per doc, of course.

The parameters "replication" and "timeout" per doc are ignored when using
bulk mode. They must be set at bulk request level.

Each bulk request is split and forwarded to relevant shards. This
splitting is very fast by searching delimiters in the request chunk,
sorting the actions that belong to one shard, and forward them as new
packets. For these packets, the bulk request level parameters "replication"
and "timeout" should work.

Although the request format looks heavy, it is most appropriate for
distributed processing.

Jörg

On Tue, Jul 29, 2014 at 1:02 AM, Ashish Mishra <laughin...@gmail.com
<javascript:>> wrote:

I'm uploading documents using syntax like the following.

curl -XPOST 'http://localhost:9200/test/type1/_bulk' -d '
{ "index" : { "_id" : "i1", "version": 3, "version_type": "external",
"replication": "async", "timeout": "5m" } }
{ "fields": "values etc." }
{ "index" : { "_id" : "i2", "version": 1, "version_type": "external",
"replication": "async", "timeout": "5m" } }
{ "fields": "values etc." }
'

A couple of questions: First, there's a fair bit of redundancy in the
action line. It feels wasteful when sending 10s of Mb / thousands of
requests per API call.
Can I roll default version_type / replication / timeout parameters into
the top-level _bulk url? I've seen a few resolved issues suggesting this.
But it's not mentioned in the documentation at
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html

Second, in the response I occasionally see errors like
{"index":"test","_type":"type1","_id":"123","status":503,"error":"UnavailableShardsException[[test][98]
[3] shardIt, [3] active : Timeout waiting for [0s], request:
org.elasticsearch.action.bulk.BulkShardRequest@36d185a1]"}

The "[0s]" part is surprising. The available-shard-timeout is 1m by
default, and I explicitly requested 5m. Does this get overridden somewhere?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ba4bde17-2668-42c4-9d14-0923571044d5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ba4bde17-2668-42c4-9d14-0923571044d5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bc6294c4-cca4-4eec-961a-491e6c6c007b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #4

Yes, this is the syntax I suggest.

Timeout of "0s" is a glitch in the bulk operations, the replica shard level
operations use TransportRequestOptions (transportOptions in
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/action/support/replication/TransportShardReplicationOperationAction.java?source=c#L685
for timeout, but the bulk request timeout is not propagated into this
class, it's just a null value, which means, extra timeout handling is not
set up.

A fix would be to add the timeout to TransportRequestOptions in BulkAction
https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/action/bulk/BulkAction.java#L49

Jörg

On Tue, Jul 29, 2014 at 11:25 PM, Ashish Mishra laughingbuddha@gmail.com
wrote:

Just to be sure I understand -- are you suggesting the following syntax:

curl -XPOST 'http://localhost:9200/test/type1/_bulk?replication=async&timeout=5m'
-d '
{ "index" : { "_id" : "i1", "version": 3, "version_type": "external" } }
{ "fields": "values etc." }
'

For my use case, version_type is always "external" for all documents in
the request. But I get the motivation for specifying it per-doc.

You said that timeout per-doc is ignored in bulk mode. So the
Elasticsearch default timeout, i.e. [1m] should have been applied to my
original requests.
Do you know why a [0s] timeout was applied instead? This was from a
response:

{"index":"test","_type":"type1","_id":"123","status":503,"error":"
UnavailableShardsException[[test][98] [3] shardIt, [3] active : Timeout
waiting for [0s], request: org.elasticsearch.action.bulk.
BulkShardRequest@36d185a1]"}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoE5fv_1rZmCEpxgEmj%2B2qoisau5e3fpQD0sLimzMmaKXw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5