Is JSON too verbose for high performance

iamthealex · November 2, 2015, 9:27pm

JSON is quite verbose. Google introduced protocol buffers many years ago, and one of the the key features was to reduce the number of bytes that need to go "over the wire". Newer attempts to optimize for bigdata use formats like those available in parquet (parquet.apache.org). There are even now thrift bindings for parquet. It seems a step backwards to go back to json. Am I missing something basic here?

I suppose that I could push (using a Kafka publisher method) to a server (that subscribes to the relevant Kafka topic) using a compressed format, and then do the expansion to json just before inserting into elasticsearch. Can you comment on the best way to take data from servers distributed worldwide and get the data into elasticsearch?

warkolm · November 3, 2015, 6:59am

My question back to you would be, are you having issues now, and what does your setup look like?

jprante · November 3, 2015, 8:46am

You are invited to revive the stopped thrift plugin https://github.com/elastic/elasticsearch-transport-thrift and demonstrate how to gain more performance over JSON.

Some notes:

JSON is only used for HTTP API
ES uses a fast builtin binary protocol between nodes which is not JSON but already binary encoded and compressed
ES/Lucene uses a lot of compression in the storage backend to speed up I/O which dominates the overall performance, there is no JSON on this system level
ES offers alternatives, CBOR https://tools.ietf.org/html/rfc7049 and SMILE http://wiki.fasterxml.com/SmileFormat to store cluster state
the possible performance gain by switching a serialization format must be balanced to the readability of the data transmitted by client apps and users which are used to a data representation language like JSON

mgrigorov · November 3, 2015, 1:05pm

Hi,

On Tue, Nov 3, 2015 at 9:56 AM, Jörg Prante noreply@discuss.elastic.co wrote:

jprante http://discuss.elastic.co/users/jprante Jörg Prante
http://discuss.elastic.co/users/jprante
November 3

You are invited to revive the stopped thrift plugin
GitHub - elastic/elasticsearch-transport-thrift: Thrift Transport for elasticsearch (STOPPED) and demonstrate
how to gain more performance over JSON.

Some notes:

JSON is only used for HTTP API

ES uses a fast builtin binary protocol between nodes which is not
JSON but already binary encoded and compressed

+1!

ES/Lucene uses a lot of compression in the storage backend to speed
up I/O which dominates the overall performance, there is no JSON on this
system level

ES offers alternatives, CBOR RFC 7049 - Concise Binary Object Representation (CBOR) and
SMILE http://wiki.fasterxml.com/SmileFormat to store cluster state

the possible performance gain by switching a serialization format
must be balanced to the readability of the data transmitted by client apps
and users which are used to a data representation language like JSON

I think the best solution would be use content negotiation. The client may
send request header "Accept" with the list of supported/preferred response
content types. If they are supported (even by plugin) then serve the best
match, otherwise fallback to JSON.

iamthealex · November 3, 2015, 2:07pm

Thanks for all the responses.

@warkolm I am not having issues now. I'm trying to select a set of tools to use for our new project.

@jprante I particularly like the idea of content negotiation with Accept headers. This approach will allow me start out with JSON and to migrate to some more compressed format if the need arises.

@mgrigorov If the need arises for something more compact than JSON, I will consider reviving the stopped thrift plugin. Was is abandoned because there was no measurable performance improvement?

jillesvangurp · November 3, 2015, 4:01pm

There's a binary version of json that is supported by mongodb called BSON. There's a add on project for jackson that adds BSON support to jackson and this allows you to easily parse and serialize to BSON: https://github.com/michel-kraemer/bson4jackson. I've actually integrated that recently into my jsonj library (which uses jackson). But I don't recommend using it unless you really need it (e.g. to integrate with mongodb).

There are a few counter intuitive things about BSON. If size is you main concern: don't use it. You'll actually use up more space typically. If parsing overhead is your main concern, the biggest penalty is mapping to in memory object structures. This doesn't change a lot between bson and json. Jackson includes a streaming JSON parser which is pretty much as fast as it gets (for json). As far as I can measure BSON support in jackson is more verbose and not a whole lot faster. Which, is why recommend not using it.

As far as I understand the Elasticsearch architecture, it uses a streaming JSON parser. I don't think the performance would differ hugely if they were to use BSON (might actually degrade a little). Much of the benefit from dedicated binary protocols comes from their reduced size. You can probably get similar improvements from simply enable gzip compression (which es supports). However binary protocols are difficult to deal with in terms of evolving the API, adding new features and using the API over HTTP.

Elasticsearch avoids a lot of parsing and serializing overhead. For example, it doesn't reconstruct in memory trees for your documents in most cases and the way it fetches and stores document json is pretty efficient already. Also, given that it is actually a json document store, it would be kind of weird to not have a json based API.

It is true that parser overhead is typically lower for binary protocols. However, parsing overhead is unlikely a big factor in overal ES performance for most setups. This is why e.g. Logstash now recommends using the http protocol as of 2.0. The advantage of using an embedded node and its internal binary protocol is just not enough to justify the complexity of that solution.

iamthealex · November 13, 2015, 9:07pm

@jillesvangurp
Thanks for all the replies.
The project is new and I'm not concerned about performance yet.
I did appreciate the answer I got; I now understand better what is going on "underneath the covers".

taowen · January 17, 2017, 2:20am

thrift is not necessarily faster. JSON is not that slow either. https://www.codeproject.com/Articles/1165627/Jsoniter-JSON-is-faster-than-thrift-avro

Topic		Replies	Views
Json format too bloated Elasticsearch	4	521	December 22, 2020
Elasticsearch and Smile encoded JSON Elasticsearch	5	2534	July 6, 2017
Is there any other way of indexing data into elasticsearch other than json Elasticsearch	5	3808	July 5, 2017
[Just Pushed]: xContent - A JSON abstraction allowing for other types Elasticsearch	4	373	July 6, 2017
Benefits of storing JSON documents in binary Smile format Elasticsearch	4	2580	July 6, 2017

Is JSON too verbose for high performance

Related topics