Disk seeks and stored fields

Jeff Dean from Google gisted some latency numbers, which make for
interesting reading.

This is why you shouldn't normally store individual fields separately
(because each field requires a disk seek):

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

By Jeff Dean (http://research.google.com/people/jeff/):

Do you mean each field in a separate index ?

Am 31.05.2012 um 15:20 schrieb Clinton Gormley:

Jeff Dean from Google gisted some latency numbers, which make for
interesting reading.

This is why you shouldn't normally store individual fields separately
(because each field requires a disk seek):

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

By Jeff Dean (Jeffrey Dean – Google Research):

On Thu, 2012-05-31 at 21:38 +0200, Daniel Schnell wrote:

Do you mean each field in a separate index ?

I mean:

  • by default ES stores your JSON doc in the _source field, which is
    set to "stored"

  • by default, the fields in your JSON doc are set to NOT be "stored"
    (ie stored as a separate field)

  • so when ES returns your doc (search or get) it just load the _source
    field and returns that, ie a single disk seek

Some people think that by storing individual fields, it will be faster
than loading the whole JSON doc from the _source field. What they don't
realise is that each stored field requires a disk seek (10ms each seek!
), and that the sum of those seeks far outweighs the cost of just
sending the _source field.

In other words, it is almost always a false optimization.

clint

Am 31.05.2012 um 15:20 schrieb Clinton Gormley:

Jeff Dean from Google gisted some latency numbers, which make for
interesting reading.

This is why you shouldn't normally store individual fields separately
(because each field requires a disk seek):

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

By Jeff Dean (Jeffrey Dean – Google Research):