Bulk api without repetition the index string - performance optimisation

Petr.Simik · October 21, 2020, 11:20am

Hi,
please is there a way how to optimise the bulk insert ?
Having valueless strings {"index":{}} increases the bulk load and costs extra resources.
this is the example but in our case we want to write huge amounts of requests/second and this add extra cost which I would like to avoid.

Can I get from this:

POST testindex/_bulk
{"index":{}}
{"cost": "11", "price": "1"}
{"index":{}}
{"cost": "12", "price": "2"}
{"index":{}}
{"cost": "13", "price": "3"}

to something like this?

POST testindex/_bulk
{"cost": "11", "price": "1"}
{"cost": "12", "price": "2"}
{"cost": "13", "price": "3"}

thank you

Christian_Dahlqvist · October 21, 2020, 11:27am

No you can not. I seriously doubt this would make any noticeable difference even if it was possible.

Petr.Simik · October 21, 2020, 11:42am

thank you
but it has serious difference
in case of indexing performance 1mil event/sec
addin 12 bytes to every message you could calculate the extra payload 12*1mil bytes every second And we can save it by removing this irrelevant string.

Christian_Dahlqvist · October 21, 2020, 12:20pm

I seriously doubt this given the amount of processing indexing and mapping requires.

While the format requires these lines and you can not remove them I guess you might be able to instead add e.g. spaces to the line and make it longer and measure the slowdown in indexing throughput this results in. My guess is that there will be virtually no measurable, repeatable difference.

Petr.Simik · October 21, 2020, 1:29pm

good idea we will check this

Petr_Hrabal · October 21, 2020, 2:37pm

there are multiple aspects how this affects performance
bulk inserts are larger - so
more network bandwith is consumed,
more RAM is required when unpacking (dont tell me that unpacking 1Mio Josns / sec do not cost a lot)
also you can change parser when you have index name and operation in URL, you can than just process data elements in inner loop, possibly with index pattern in local cache. my experience is from different languages ... but you can get major performance improvement.... very often improvement factor can be eaven 5 times faster or more...

Christian_Dahlqvist · October 21, 2020, 3:43pm

I would expect the size of the header line to be small compared to the size of the document so would expect it to make little difference. There are a lot of other factors that will have a much larger impact on indexing throughput so I suspect you are optimising the wrong thing. I would prefer seeing evidence that it matters and makes a difference rather that try to reason about it as there is a large number of factors that contribute.

When it comes to indexing throughput it is quite often the speed of the storage that is the limiting factor so the number of compute cycles spent might not matter that much nor be noticeable.

Mark_Harwood · October 21, 2020, 4:06pm

Yes this does all seem verbose but the pragmatic reason for the design is that co-ordinating nodes do not need to get bogged down doing the most expensive job of parsing JSON content of docs.

The format of instruction, followed by doc content, followed by instruction, followed by doc content allows an important optimisation - the coordinating node routing docs to data nodes just parses every other line.
This results in a nice speed-up - the same way the post office can route mail more efficiently if it doesn't have to open the contents of envelopes. It just reads the routing info on the outside.

Petr_Hrabal · October 21, 2020, 4:37pm

well we have large ingest of documents with avarge size of cca 110 Bytes, header of size of 16 is significant.
moving index name to url brought significant improvemment in network utilization
having to prepend still just nearly empty obect '{"index": {}}\n' seems bit useless...
we are loading it to cca 10 Elastic loadbalancers but there is only one producer we are doing up to 0,5M/event sec per producer ... and network utilization is significant issue. compression is of course an option ... but also have its own cost.
yes its not an catastrofic issue, you can work around that ... but when you can move Index name to url ... why you could not move operation type there there as well?
actualy all values in this obect are optional ... but not the obect itself...
so why not to go whole way?

Petr_Hrabal · October 21, 2020, 4:42pm

well fun is, that if you move the operation to the header, and you already have index name thre .... you can actualy skip parsing them on coordinating node... possibly with some parameter to indicate to the parser that thre are no instruction Jsons..

Edit: yes I know you cant use it every time .... but to have it as option for fast load ? sounds prety usefull

Petr_Hrabal · October 21, 2020, 5:19pm

I forgot to respond to your last comment... we are significantly CPU bound, yes there are dips that might be connected to IO factor .. but most of the time we are realy bound by CPU (probably caused by large amount of small messages)

system · November 18, 2020, 5:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.