ElasticSearch built-in Jackson stream parser is fastest way to extract fields


(Brian Yoder) #1

Just an FYI... Start with, for example, the following JSON document (all on
one line for the _bulk API, but pretty printed below). This follows my
basic document struture: An array of field names, whith each of those
fields taking either a single value or an array of heterogenous values.
Nothing more complex than a Map<String,Object> can represent, in which
Object is either a single type (String, Boolean, and so on) or an
Array. A subset of the "throw any JSON document into ES", but still
a very useful subset that far exceeds any database engine I've ever used:

{
"_index" : "twitter" ,
"_type" : "tweet" ,
"_id" : "3" ,
"_score" : 1.0 ,
"_source" : {
"user" : "bbthing68" ,
"postDate" : "2012-11-15T14:12:12" ,
"altitude" : 45767 ,
"dst" : true ,
"prefix" : null ,
"counts" : [ 1 , 2 , 3.14149 , "11.1" , "13" ] ,
"vdst" : [ true , false , true ] ,
"message" : [ 2 , "Just trying this out" , "With one/two multivalued
fields" ]
}
}

Both the SearchHit.getSourceAsString and the GetResponse.getSourceAsStringmethods return the following JSON string (again, it's on one line, but it's
pretty printed here only for this post):

{
"user" : "bbthing68" ,
"postDate" : "2012-11-15T14:12:12" ,
"altitude" : 45767 ,
"dst" : true ,
"prefix" : null ,
"counts" : [ 1 , 2 , 3.14149 , "11.1" , "13" ] ,
"vdst" : [ true , false , true ] ,
"message" : [ 2 , "Just trying this out" , "With one/two multivalued
fields" ]
}

I was using the getSourceAsMap methods, which return a Map<String,Object>.
But when I use the JsonParser in stream parsing mode (as supplied directly
by ElasticSearch; no need to fetch the full Jackson jar file), I can
directly stream parse that source so very much faster. My overall
response times are now much lower. And it's also much easier and faster for
me to just parse the source and pull out only the subset of the fields I
want instead of try to tell ES which subset of fields I want.

Oh, and when I store the fields from my stream parsing process, I put them
into a LinkedHashMap<String,Object>. That little bit of overhead keeps the
keys (field names) in the exact same order as they appear in the source.
Which is really awesomely cool. No more jumbled order of field names when
displaying results during testing!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

To shed some light on this, the code behind getSourceAsMap() uses some
format detection and decompressing in XContentHelper. Did you disable
source compression? It is enabled by default.

Jörg

Am 12.03.13 16:26, schrieb InquiringMind:

I was using the getSourceAsMap methods, which return a
Map<String,Object>. But when I use the JsonParser in stream parsing
mode (as supplied directly by ElasticSearch; no need to fetch the full
Jackson jar file), I can directly stream parse that source /so very
much faster/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #3

I completely ignored any settings related to source compression, letting it
default to whatever value it has in 19.4, 19.10, and 20.4 (the ES versions
I've used).

I originally prototyped my stream parsing of getSourceAsString because I
wanted the field order preserved using a LinkedHashMap.

To my surprise, my stream parser is much faster.

Of course, I use the JsonParser.nextValue method, not its nextToken method,
for my stream parsing implementation. It seems to greatly simplify the
parsing code. Jackson's rather fast. Because ES only made the stream parser
available it has rather forced me to stream-parse instead of relying on its
Tree Model or OJM. I don't know if they have any performance overheads
relative to what I have to do when stream parsing, but I've adapted and it
works very well.

On Tuesday, March 12, 2013 3:02:07 PM UTC-4, Jörg Prante wrote:

To shed some light on this, the code behind getSourceAsMap() uses some
format detection and decompressing in XContentHelper. Did you disable
source compression? It is enabled by default.

Jörg

Am 12.03.13 16:26, schrieb InquiringMind:

I was using the getSourceAsMap methods, which return a
Map<String,Object>. But when I use the JsonParser in stream parsing
mode (as supplied directly by ElasticSearch; no need to fetch the full
Jackson jar file), I can directly stream parse that source /so very
much faster/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #4

There is almost nothing faster in Java JSON parsing than Jackson in
streaming mode since it uses a highly optimized parser. Note, if you
just use plain JSON (and not SMILE or compressed JSON) you can add
Jackson libs to your project and also use TreeModel or ObjectMapper.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Klaus Brunner) #5

On Tuesday, 12 March 2013 16:26:47 UTC+1, InquiringMind wrote:

I was using the getSourceAsMap methods, which return a Map<String,Object>.
But when I use the JsonParser in stream parsing mode (as supplied directly
by ElasticSearch; no need to fetch the full Jackson jar file), I can
directly stream parse that source so very much faster. My overall
response times are now much lower. And it's also much easier and faster for
me to just parse the source and pull out only the subset of the fields I
want instead of try to tell ES which subset of fields I want.

Sounds interesting, but I'm not quite sure what you're doing (and why it's
faster). Do you get a BytesReference via SearchHit.sourceRef() first, and
then let a JsonParser operate on the return value of its .streamInput()?

I currently use SearchHit.source() - which returns a byte[] via
BytesReference.bytes() - and then parse that with Jackson, but if there's a
way to save an array copy in the whole process, it would be nice. Needs to
be compression-safe though.

Klaus

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #6

Sounds interesting, but I'm not quite sure what you're doing (and why it's
faster).

It's faster (almost 3 times as fast) as it is to use the getSourceAsMap
method.

I had tried to limit and extract by individual field back when I started
with 19.4, but just couldn't get it to work. I've since mastered enough of
the index settings and mappings, but have stayed with extracting the
_source.

Do you get a BytesReference via SearchHit.sourceRef() first, and then let
a JsonParser operate on the return value of its .streamInput()?

No, I get a String via SearchHit.getSourceAsString() first, and then let a
JsonParser operate on it. I wasn't sure what the BytesReference was:
Compressed? Not compressed? UTF-8? It didn't seem deterministic enough for
me based on the documentation. And when parsing the source as either a
String (or as the originally used Map<String,Object> returned via
getSourceAsMap), I never had any problems. Even all of my Chinese
characters came out perfectly.

I currently use SearchHit.source() - which returns a byte[] via
BytesReference.bytes() - and then parse that with Jackson, but if there's a
way to save an array copy in the whole process, it would be nice. Needs to
be compression-safe though.

I agree. I just wasn't sure about what I might need to do with compression.
I assume that getSourceAsString already knows what to do based on how the
_source was stored. At least, that's been my experience so far. And when
migrating from 19.4 to 19.10 and now to 20.4, I've seen no issues at all
20.4 using databases built with 19.4 and updated with 10.4, 19.10, and
20.4. Migration has been painless and smooth.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brian Yoder) #7

One more thing I forget to mention: When I parse the JSON getSourceAsString
myself, I store the fields I want into a LinkedHashMap instead of a simple
HashMap.

That tiny amount of overhead ensures that I can iterate across the map and
show the fields in the exact same order that they are stored in the
_source. Using the map returned by the getSourceAsMap method, the fields
are jumbled and it makes it more difficult to verify results by eye.

And a small update to my previous post below in blue...

On Wednesday, March 13, 2013 2:37:17 PM UTC-4, InquiringMind wrote:

Sounds interesting, but I'm not quite sure what you're doing (and why it's

faster).

It's faster (almost 3 times as fast) as it is to use the getSourceAsMap
method.

I had tried to limit and extract by individual field back when I started
with 19.4, but just couldn't get it to work. I've since mastered enough of
the index settings and mappings, but have stayed with extracting the
_source.

Do you get a BytesReference via SearchHit.sourceRef() first, and then let
a JsonParser operate on the return value of its .streamInput()?

No, I get a String via SearchHit.getSourceAsString() first, and then let a
JsonParser operate on it. I wasn't sure what the BytesReference was:
Compressed? Not compressed? UTF-8? It didn't seem deterministic enough for
me based on the documentation. And when parsing the source as either a
String (or as the originally used Map<String,Object> returned via
getSourceAsMap), I never had any problems. Even all of my Chinese
characters came out perfectly.

I currently use SearchHit.source() - which returns a byte[] via
BytesReference.bytes() - and then parse that with Jackson, but if there's a
way to save an array copy in the whole process, it would be nice. Needs to
be compression-safe though.

I agree. I just wasn't sure about what I might need to do with
compression. I assume that getSourceAsString already knows what to do based
on how the _source was stored. At least, that's been my experience so far.
And when migrating from 19.4 to 19.10 and now to 20.4, I've seen no issues
at all with 20.4 using databases built with ES 19.4 and updated with ES
versions
19.4, 19.10, and 20.4. Migration has been painless and smooth.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Swati Jain) #8

Hi Brian,

I am new to the elasticSearch and currently working with version 1.5.1. I
am working on parsing a json string returned by getSourceAsString() method
into an object. JsonParser class in elasticsearch (import
org.elasticsearch.common.jackson.core.JsonParser) doesn't allow me to
create an instance of it to use it's method readValueAs(class name).

I would like to use elasticsearch apis as much as I can instead of
including Jackson jar separately in my code. Can you please show me the
JsonParser code you used to convert a string into an object ??

Thanks,
Swati

On Tuesday, March 12, 2013 at 11:26:47 AM UTC-4, Brian wrote:

Just an FYI... Start with, for example, the following JSON document (all
on one line for the _bulk API, but pretty printed below). This follows my
basic document struture: An array of field names, whith each of those
fields taking either a single value or an array of heterogenous values.
Nothing more complex than a Map<String,Object> can represent, in which
Object is either a single type (String, Boolean, and so on) or an
Array. A subset of the "throw any JSON document into ES", but still
a very useful subset that far exceeds any database engine I've ever used:

{
"_index" : "twitter" ,
"_type" : "tweet" ,
"_id" : "3" ,
"_score" : 1.0 ,
"_source" : {
"user" : "bbthing68" ,
"postDate" : "2012-11-15T14:12:12" ,
"altitude" : 45767 ,
"dst" : true ,
"prefix" : null ,
"counts" : [ 1 , 2 , 3.14149 , "11.1" , "13" ] ,
"vdst" : [ true , false , true ] ,
"message" : [ 2 , "Just trying this out" , "With one/two multivalued
fields" ]
}
}

Both the SearchHit.getSourceAsString and the GetResponse.getSourceAsString
methods return the following JSON string (again, it's on one line, but it's
pretty printed here only for this post):

{
"user" : "bbthing68" ,
"postDate" : "2012-11-15T14:12:12" ,
"altitude" : 45767 ,
"dst" : true ,
"prefix" : null ,
"counts" : [ 1 , 2 , 3.14149 , "11.1" , "13" ] ,
"vdst" : [ true , false , true ] ,
"message" : [ 2 , "Just trying this out" , "With one/two multivalued
fields" ]
}

I was using the getSourceAsMap methods, which return a Map<String,Object>.
But when I use the JsonParser in stream parsing mode (as supplied directly
by ElasticSearch; no need to fetch the full Jackson jar file), I can
directly stream parse that source so very much faster. My overall
response times are now much lower. And it's also much easier and faster for
me to just parse the source and pull out only the subset of the fields I
want instead of try to tell ES which subset of fields I want.

Oh, and when I store the fields from my stream parsing process, I put them
into a LinkedHashMap<String,Object>. That little bit of overhead keeps the
keys (field names) in the exact same order as they appear in the source.
Which is really awesomely cool. No more jumbled order of field names when
displaying results during testing!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fd12f55c-0b4a-4c37-b639-98d686112d0f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Brian Yoder) #9

Swati,

Well, I tend not to use the built-in Jackson parser anymore. The only
advantage I've seen to stream parsing is that I can dynamically adapt to
different objects in my own code. But I can't release the code since it's
owned by my employer. And for most tasks these days, I use the Jackson jar
files and the data binding model. By the way, here are the only additional
JAR files that I use in my Elasticsearch-based tools that also include
Elasticsearch jars:

For the full Jackson support. There are later versions but these work for
now until the rest of the company moves to Java 8:

jackson-annotations-2.2.3.jar
jackson-core-2.2.3.jar
jackson-databind-2.2.3.jar

This gives me the full Netty server (got tired of looking for it buried
inside ES, and found this to be very simple and easy to use). Again, there
are later versions but this one works well enough:

netty-3.5.8.Final.jar

And this is the magic that brings Netty to life. My front end simply
publishes each incoming Netty MessageEvent to the LMAX Disruptor ring
buffer. Then I can predefine a fixed number of background WorkHandler
threads to consume the MessageEvent objects, handling each one and
responding back to its client. No matter how much load is slammed into the
front end, the number of Netty threads stays small since they only publish
and they're done. And so, the total thread count stays small even when
intense bursts of clients slam the server:

disruptor-3.2.0.jar

I hope this helps. I'd love to publish more details but this is about all I
can do for now.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/773a4516-89c1-4e21-bd65-e5e7bf48c7e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Swati Jain) #10

Thanks Brian for your reply. I have started using the jars that you
suggested here and they are easy to work with. Thank you.

Swati

On Thursday, April 30, 2015 at 5:47:43 PM UTC-4, Brian wrote:

Swati,

Well, I tend not to use the built-in Jackson parser anymore. The only
advantage I've seen to stream parsing is that I can dynamically adapt to
different objects in my own code. But I can't release the code since it's
owned by my employer. And for most tasks these days, I use the Jackson jar
files and the data binding model. By the way, here are the only additional
JAR files that I use in my Elasticsearch-based tools that also include
Elasticsearch jars:

For the full Jackson support. There are later versions but these work for
now until the rest of the company moves to Java 8:

jackson-annotations-2.2.3.jar
jackson-core-2.2.3.jar
jackson-databind-2.2.3.jar

This gives me the full Netty server (got tired of looking for it buried
inside ES, and found this to be very simple and easy to use). Again, there
are later versions but this one works well enough:

netty-3.5.8.Final.jar

And this is the magic that brings Netty to life. My front end simply
publishes each incoming Netty MessageEvent to the LMAX Disruptor ring
buffer. Then I can predefine a fixed number of background WorkHandler
threads to consume the MessageEvent objects, handling each one and
responding back to its client. No matter how much load is slammed into the
front end, the number of Netty threads stays small since they only publish
and they're done. And so, the total thread count stays small even when
intense bursts of clients slam the server:

disruptor-3.2.0.jar

I hope this helps. I'd love to publish more details but this is about all
I can do for now.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/32783ea6-faaf-4a77-9dfa-31be451939c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #11