A question first: Are rivers in-process wrt. indexing? Or do they need
to cross an HTTP boundary to do the indexing?
(If the answer to this question is that indexing from a river always
will have to cross an HTTP boundary, then the answer to the question in
the Subject, is "bulk")
Existing rivers like the RSS river, and the streaming JSON river, use
bulk indexing, so I suspect the answer is that bulk is the most
efficient...?
I'm writing a simple river where I'm doing HTTP GET to get chunks of
documents. The body if the GET response is a JSON array, with one JSON
object per line, ie. something like this:
[{"a": 1, "b": "c", "d": 3.14},
{"a": 2, "b": "lala", "d": 2.78},
{"a": 3, "b": "fdsf", "d": 6.023e+23}]
If the river is in-process and communicates in-process with the index,
then what will probably(*) be most efficient will be to parse as little
as possible and allocate as little as possible, ie. read in each line,
strip leading and trailing brackets, commas and whitespace.
Or alternatively, if it's possible, pull the tokens from an
XContentParser and put them directly into an XContentBuilder (I haven't
checked to see what's possible for this kind).
Thanks!
- Steinar
(*) Only actual measurement can tell, of course
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.