[hadoop] Performance in using Text vs. MapWritable


(Brian Stempin) #1

Hi,
I'm currently using the elasticsearch-hadoop component to load data into my
ES cluster. Currently, the ESOutputFormat will accept a Map<Writable,
Wrtiable> or a Text that is already in JSON format. My question: Is there
a performance advantage to using one over the other?

Thanks,
Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/20302cc7-799f-4723-89db-3b050123d2bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Costin Leau) #2

Hey,

There is but in the big picture it doesn't make any difference. If the data is already in JSON format then es-hadoop can
stream the data directly without having to do any conversion. With a data (Map<Writable,Writable>) the map has to be
converted into JSON - note that this process is quite efficient and uses the same amount of memory no matter the number
of documents/maps.
Consider Hadoop batch nature I would not worry about choosing one over the other but rather focus on ease of use.

If the data is in JSON or you want ultimate control over what data is sent to Elasticsearch, then JSON is the way to go

  • the data is streamed as is.
    If you don't use JSON and have data in various formats readable through Hadoop, then pick the Map<Writable,Writable> -
    it gives you maximum interoperability and you don't have to worry about transforming data into an intermediate format.

Hope this helps,

On 3/14/2014 4:46 PM, Brian Stempin wrote:

Hi,
I'm currently using the elasticsearch-hadoop component to load data into my ES cluster. Currently, the ESOutputFormat
will accept a Map<Writable, Wrtiable> or a Text that is already in JSON format. My question: Is there a performance
advantage to using one over the other?

Thanks,
Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/20302cc7-799f-4723-89db-3b050123d2bd%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/20302cc7-799f-4723-89db-3b050123d2bd%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53232046.4080206%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Brian Stempin) #3

It does, thanks.

Brian

On Fri, Mar 14, 2014 at 11:29 AM, Costin Leau costin.leau@gmail.com wrote:

Hey,

There is but in the big picture it doesn't make any difference. If the
data is already in JSON format then es-hadoop can stream the data directly
without having to do any conversion. With a data (Map<Writable,Writable>)
the map has to be converted into JSON - note that this process is quite
efficient and uses the same amount of memory no matter the number of
documents/maps.
Consider Hadoop batch nature I would not worry about choosing one over the
other but rather focus on ease of use.

If the data is in JSON or you want ultimate control over what data is sent
to Elasticsearch, then JSON is the way to go - the data is streamed as is.
If you don't use JSON and have data in various formats readable through
Hadoop, then pick the Map<Writable,Writable> - it gives you maximum
interoperability and you don't have to worry about transforming data into
an intermediate format.

Hope this helps,

On 3/14/2014 4:46 PM, Brian Stempin wrote:

Hi,
I'm currently using the elasticsearch-hadoop component to load data into
my ES cluster. Currently, the ESOutputFormat
will accept a Map<Writable, Wrtiable> or a Text that is already in JSON
format. My question: Is there a performance
advantage to using one over the other?

Thanks,
Brian

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to
elasticsearch+unsubscribe@googlegroups.com <mailto:elasticsearch+
unsubscribe@googlegroups.com>.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/20302cc7-
799f-4723-89db-3b050123d2bd%40googlegroups.com
<https://groups.google.com/d/msgid/elasticsearch/20302cc7-
799f-4723-89db-3b050123d2bd%40googlegroups.com?utm_medium=
email&utm_source=footer>.

For more options, visit https://groups.google.com/d/optout.

--
Costin

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/hs-LJ6Le2AQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/53232046.4080206%40gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CANB1ciC56FppVsL6tAha-oad%2BDGMP7cJMdZLPU1-RkRUN1qtkg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4