Options to index data in ES


(hari) #1

Based on my understanding ES has 4 out of the box options to index data.
although I understand in theory how they work.
I'm wondering if someone has done some real world implementation and
comparison on what works best considering huge volumes of data (100 K
records per hour) in regular intervals that needs to be indexed and error
handling is of importance in case indexing fails.

curl -XPUT - This is perhaps the simplest way to index a document, you just
perform a PUT on a REST endpoint,
this is best seen as an option during development to index documents for
testing quickly.

HTTP Bulk API - Push approach to index data, if you have an external
application that consolidates the data in a timely manner
and then formats it to JSON to be indexed. This is much more reliable as
compared to UDP bulk import as you get an acknowledgement
of index operation and can take corrective steps based on the response.

UDP Bulk API - Connectionless datagram protocol. This is faster but not so
reliable
E.g. cat bulk.txt | nc -w 0 -u localhost 9700

River plugin - Pull approach, runs within ES node and can pull data from
any datasource.
Can be used when we are expecting a constant change of data that needs to
be indexed and we
don't want to write another external application to push data into ES for
indexing.

River plugin also supports import using Bulk API, this is usefull in cases
where the river plugin
wants to accumulate the data for certain threshold before performing an
import / indexing.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Nik Everett) #2

On Tue, Nov 5, 2013 at 10:40 AM, Hariharan Vadivelu hariinfo@gmail.comwrote:

HTTP Bulk API - Push approach to index data, if you have an external

application that consolidates the data in a timely manner
and then formats it to JSON to be indexed. This is much more reliable as
compared to UDP bulk import as you get an acknowledgement
of index operation and can take corrective steps based on the response.

We have pretty good success with this. We're seeing ~600 big documents per
second using multi process producers with a reasonably tiny Elasticsearch
cluster.

We like push over pull because it gives us a lot of control and we have
good tools around other push processes.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #3

How big are the records? 100k records per hour are not much, if a record is
just 1k average size.

You have forgotten the Java TransportClient bulk indexing method, it's the
method I prefer: you can connect from remote, you can connect to cluster
nodes you configure, you can index with multiple threads, and you save a
bit of HTTP overhead by using the native ES protocol. With the Java
TransportClient, I can saturate the network interface (~10-11 MB/sec) for
hours while indexing...

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4