What's the most efficient indexing for a river?

A question first: Are rivers in-process wrt. indexing? Or do they need
to cross an HTTP boundary to do the indexing?

(If the answer to this question is that indexing from a river always
will have to cross an HTTP boundary, then the answer to the question in
the Subject, is "bulk")

Existing rivers like the RSS river, and the streaming JSON river, use
bulk indexing, so I suspect the answer is that bulk is the most
efficient...?

I'm writing a simple river where I'm doing HTTP GET to get chunks of
documents. The body if the GET response is a JSON array, with one JSON
object per line, ie. something like this:
[{"a": 1, "b": "c", "d": 3.14},
{"a": 2, "b": "lala", "d": 2.78},
{"a": 3, "b": "fdsf", "d": 6.023e+23}]

If the river is in-process and communicates in-process with the index,
then what will probably(*) be most efficient will be to parse as little
as possible and allocate as little as possible, ie. read in each line,
strip leading and trailing brackets, commas and whitespace.

Or alternatively, if it's possible, pull the tokens from an
XContentParser and put them directly into an XContentBuilder (I haven't
checked to see what's possible for this kind).

Thanks!

  • Steinar

(*) Only actual measurement can tell, of course

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

A river uses the Java API. Its requests don't go through the rest layer,
but are already in binary format and get sent through the transport layer,
port 9300 by default (tcp).
Also, the client obtained within a river is a client that points to the
node where the river is running. Thus your client is cluster-aware (like
the node client using the Java API, even outside of elasticsearch, but this
one most likely holds data too), which means that it knows where the shards
and where the requests need to go. Bulk is always best as it would try and
group the requests per shard and minimize the network roundtrips.

On Wednesday, November 13, 2013 10:11:27 AM UTC+1, Steinar Bang wrote:

A question first: Are rivers in-process wrt. indexing? Or do they need
to cross an HTTP boundary to do the indexing?

(If the answer to this question is that indexing from a river always
will have to cross an HTTP boundary, then the answer to the question in
the Subject, is "bulk")

Existing rivers like the RSS river, and the streaming JSON river, use
bulk indexing, so I suspect the answer is that bulk is the most
efficient...?

I'm writing a simple river where I'm doing HTTP GET to get chunks of
documents. The body if the GET response is a JSON array, with one JSON
object per line, ie. something like this:
[{"a": 1, "b": "c", "d": 3.14},
{"a": 2, "b": "lala", "d": 2.78},
{"a": 3, "b": "fdsf", "d": 6.023e+23}]

If the river is in-process and communicates in-process with the index,
then what will probably(*) be most efficient will be to parse as little
as possible and allocate as little as possible, ie. read in each line,
strip leading and trailing brackets, commas and whitespace.

Or alternatively, if it's possible, pull the tokens from an
XContentParser and put them directly into an XContentBuilder (I haven't
checked to see what's possible for this kind).

Thanks!

  • Steinar

(*) Only actual measurement can tell, of course

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Luca Cavanna cavannaluca@gmail.com:

A river uses the Java API. Its requests don't go through the rest
layer, but are already in binary format and get sent through the
transport layer, port 9300 by default (tcp).

Ah, ok. Somewhere in between the two possibilities I envisioned...:slight_smile:

Also, the client obtained within a river is a client that points to
the node where the river is running. Thus your client is cluster-aware
(like the node client using the Java API, even outside of
elasticsearch, but this one most likely holds data too), which means
that it knows where the shards and where the requests need to go. Bulk
is always best as it would try and group the requests per shard and
minimize the network roundtrips.

Yes, as long as there is a network crossing involved, then bulk will
undoubtedly be the best.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.