I wonder why bulk indexing is faster than single indexing.
I'm curious from the point of view of elasticsearch, other than being connected and closed and network communication problems.
Based on the default refresh time, if there is 1,000 single index per second and 1 bulk index (1,000 documents) per second, is the bulk faster from the point of view of elasticsearch? I'm curious about the details.
Unless the bulk request is targeting a very large number of shards, a bulk request will generally result in a number of documents being written to a shards transaction log at once. Every write is as far as I know synced to disk before a response is sent, so using bulk requests can reduce the number of IOPS and disk I/O. There may other benefits of handling documents in bulk, but this can have a big impact as disk I/O often is a limiting factor.
Writes are fsynced when a flush occurs.
I mean in the above example why is it good for performance when processing the same amount of documents within refresh time
I don't think the performance will be different when indexing 1,000 documents by 1 document in 1 second and 1 bulk (1,000 documents) in 1 second.
Also it requires much less HTTP requests/responses.
No, that is not correct. Documents are written and fsynced to the transaction log irrespective of when the flush to create a new segment and make the documents searchable happens. I would therefore expect a not insignificant difference in disk I/O sync calls.
Is it correct that there is no difference in performance from the perspective of an Elasticsearch server between indexing a single document 1000 times at the pace of the refresh interval and processing a bulk request of 1000 documents at once?
The transaction log you are talking about is translog right?
So to recap, what you're saying is that bulk is better than single because translog has less io, since translog has io per request, right?
Yes, that is one aspect. There is also a difference between handling a single HTTP request for a bulk request of N documents vs N separate HTTP requests. There may be other factors also contributing.
If the index.translog.durability parameter is set to async, then this will have no effect, right?
That setting will remove the overhead for fsyncing individual requests at the cost of reduced durability and resilience. Other factors, e.g. overhead related to handling multiple HTTP requests, would not go away though, so I would still expect a performance difference.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.