Indexing Performance vs Document Size

I see a big difference in indexing throughput between small documents and large documents. Is this expected and why under the below test conditions?

  • Small documents are 1KB, large documents are 10KB to 30KB
  • Observed throughput is 3MB/s vs 20MB/s (not able to beat 4000 documents/sec)
  • Bulk size is 300 documents; There is no improvement in performance beyond this
  • Refresh is disabled (-1)
  • Not analyzing any field
  • Index buffer and translog are sized appropriately
  • Disk storage, no replication, 1 shard
  • No big difference with auto-id

BTW, refresh still happens when index buffer is half full (index buffer must have a ping pong implementation).

I would like to understand what is the per document processing overhead (including per field) and where are the bottlenecks.

What ES version? What machine? What operating system? What network interface capacity? What Java VM version? How many clients are indexing?

Each field results in a Lucene index where you can search on. Also, enabled _source and _all contribute to the indexing. If you do not carefully specify your field mapping, you put more load on the document indexing than possibly required.

  • ES: 2.3.2, Lucene 5.5.0
  • OS: Ubuntu 14.04.1
  • Java: 8u73
    java version "1.8.0_73"
    Java(TM) SE Runtime Environment (build 1.8.0_73-b02)
    Java HotSpot(TM) 64-Bit Server VM (build 25.73-b02, mixed mode)
  • _all is disabled
  • _source disabled
  • Fields are explicitly mapped
  • Same configs and hardware with only change being document size 1K vs 30K
  • CPU is close to 100%
  • The difference in throughput with size of documents is surprising (3MB/s vs 20MB/s)

Maybe the client is challenged? How do you generate the bulk input? What client language/tool? Is the client running on a separate machine?