Is JSON too verbose for high performance


(Alex) #1

JSON is quite verbose. Google introduced protocol buffers many years ago, and one of the the key features was to reduce the number of bytes that need to go "over the wire". Newer attempts to optimize for bigdata use formats like those available in parquet (parquet.apache.org). There are even now thrift bindings for parquet. It seems a step backwards to go back to json. Am I missing something basic here?

I suppose that I could push (using a Kafka publisher method) to a server (that subscribes to the relevant Kafka topic) using a compressed format, and then do the expansion to json just before inserting into elasticsearch. Can you comment on the best way to take data from servers distributed worldwide and get the data into elasticsearch?


(Mark Walkom) #2

My question back to you would be, are you having issues now, and what does your setup look like?


(Jörg Prante) #3

You are invited to revive the stopped thrift plugin https://github.com/elastic/elasticsearch-transport-thrift and demonstrate how to gain more performance over JSON.

Some notes:

  • JSON is only used for HTTP API
  • ES uses a fast builtin binary protocol between nodes which is not JSON but already binary encoded and compressed
  • ES/Lucene uses a lot of compression in the storage backend to speed up I/O which dominates the overall performance, there is no JSON on this system level
  • ES offers alternatives, CBOR https://tools.ietf.org/html/rfc7049 and SMILE http://wiki.fasterxml.com/SmileFormat to store cluster state
  • the possible performance gain by switching a serialization format must be balanced to the readability of the data transmitted by client apps and users which are used to a data representation language like JSON

(Martin Grigorov) #4

Hi,

On Tue, Nov 3, 2015 at 9:56 AM, Jörg Prante noreply@discuss.elastic.co wrote:

jprante http://discuss.elastic.co/users/jprante Jörg Prante
http://discuss.elastic.co/users/jprante
November 3

You are invited to revive the stopped thrift plugin
https://github.com/elastic/elasticsearch-transport-thrift and demonstrate
how to gain more performance over JSON.

Some notes:

  • JSON is only used for HTTP API
  • ES uses a fast builtin binary protocol between nodes which is not
    JSON but already binary encoded and compressed

+1!

  • ES/Lucene uses a lot of compression in the storage backend to speed
    up I/O which dominates the overall performance, there is no JSON on this
    system level
  • ES offers alternatives, CBOR https://tools.ietf.org/html/rfc7049 and
    SMILE http://wiki.fasterxml.com/SmileFormat to store cluster state
  • the possible performance gain by switching a serialization format
    must be balanced to the readability of the data transmitted by client apps
    and users which are used to a data representation language like JSON

I think the best solution would be use content negotiation. The client may
send request header "Accept" with the list of supported/preferred response
content types. If they are supported (even by plugin) then serve the best
match, otherwise fallback to JSON.


(Alex) #5

Thanks for all the responses.

@warkolm I am not having issues now. I'm trying to select a set of tools to use for our new project.

@jprante I particularly like the idea of content negotiation with Accept headers. This approach will allow me start out with JSON and to migrate to some more compressed format if the need arises.

@mgrigorov If the need arises for something more compact than JSON, I will consider reviving the stopped thrift plugin. Was is abandoned because there was no measurable performance improvement?


(Jillesvangurp) #6

There's a binary version of json that is supported by mongodb called BSON. There's a add on project for jackson that adds BSON support to jackson and this allows you to easily parse and serialize to BSON: https://github.com/michel-kraemer/bson4jackson. I've actually integrated that recently into my jsonj library (which uses jackson). But I don't recommend using it unless you really need it (e.g. to integrate with mongodb).

There are a few counter intuitive things about BSON. If size is you main concern: don't use it. You'll actually use up more space typically. If parsing overhead is your main concern, the biggest penalty is mapping to in memory object structures. This doesn't change a lot between bson and json. Jackson includes a streaming JSON parser which is pretty much as fast as it gets (for json). As far as I can measure BSON support in jackson is more verbose and not a whole lot faster. Which, is why recommend not using it.

As far as I understand the Elasticsearch architecture, it uses a streaming JSON parser. I don't think the performance would differ hugely if they were to use BSON (might actually degrade a little). Much of the benefit from dedicated binary protocols comes from their reduced size. You can probably get similar improvements from simply enable gzip compression (which es supports). However binary protocols are difficult to deal with in terms of evolving the API, adding new features and using the API over HTTP.

Elasticsearch avoids a lot of parsing and serializing overhead. For example, it doesn't reconstruct in memory trees for your documents in most cases and the way it fetches and stores document json is pretty efficient already. Also, given that it is actually a json document store, it would be kind of weird to not have a json based API.

It is true that parser overhead is typically lower for binary protocols. However, parsing overhead is unlikely a big factor in overal ES performance for most setups. This is why e.g. Logstash now recommends using the http protocol as of 2.0. The advantage of using an embedded node and its internal binary protocol is just not enough to justify the complexity of that solution.


(Alex) #7

@jillesvangurp
Thanks for all the replies.
The project is new and I'm not concerned about performance yet.
I did appreciate the answer I got; I now understand better what is going on "underneath the covers".


(Taowen) #8

thrift is not necessarily faster. JSON is not that slow either. https://www.codeproject.com/Articles/1165627/Jsoniter-JSON-is-faster-than-thrift-avro


(system) #9