Hi,
I'm currently using the elasticsearch-hadoop component to load data into my
ES cluster. Currently, the ESOutputFormat will accept a Map<Writable,
Wrtiable> or a Text that is already in JSON format. My question: Is there
a performance advantage to using one over the other?
There is but in the big picture it doesn't make any difference. If the data is already in JSON format then es-hadoop can
stream the data directly without having to do any conversion. With a data (Map<Writable,Writable>) the map has to be
converted into JSON - note that this process is quite efficient and uses the same amount of memory no matter the number
of documents/maps.
Consider Hadoop batch nature I would not worry about choosing one over the other but rather focus on ease of use.
If the data is in JSON or you want ultimate control over what data is sent to Elasticsearch, then JSON is the way to go
the data is streamed as is.
If you don't use JSON and have data in various formats readable through Hadoop, then pick the Map<Writable,Writable> -
it gives you maximum interoperability and you don't have to worry about transforming data into an intermediate format.
Hope this helps,
On 3/14/2014 4:46 PM, Brian Stempin wrote:
Hi,
I'm currently using the elasticsearch-hadoop component to load data into my ES cluster. Currently, the ESOutputFormat
will accept a Map<Writable, Wrtiable> or a Text that is already in JSON format. My question: Is there a performance
advantage to using one over the other?
There is but in the big picture it doesn't make any difference. If the
data is already in JSON format then es-hadoop can stream the data directly
without having to do any conversion. With a data (Map<Writable,Writable>)
the map has to be converted into JSON - note that this process is quite
efficient and uses the same amount of memory no matter the number of
documents/maps.
Consider Hadoop batch nature I would not worry about choosing one over the
other but rather focus on ease of use.
If the data is in JSON or you want ultimate control over what data is sent
to Elasticsearch, then JSON is the way to go - the data is streamed as is.
If you don't use JSON and have data in various formats readable through
Hadoop, then pick the Map<Writable,Writable> - it gives you maximum
interoperability and you don't have to worry about transforming data into
an intermediate format.
Hope this helps,
On 3/14/2014 4:46 PM, Brian Stempin wrote:
Hi,
I'm currently using the elasticsearch-hadoop component to load data into
my ES cluster. Currently, the ESOutputFormat
will accept a Map<Writable, Wrtiable> or a Text that is already in JSON
format. My question: Is there a performance
advantage to using one over the other?
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/hs-LJ6Le2AQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/53232046.4080206%40gmail.com.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.