Disk seeks and stored fields


(Clinton Gormley) #1

Jeff Dean from Google gisted some latency numbers, which make for
interesting reading.

This is why you shouldn't normally store individual fields separately
(because each field requires a disk seek):

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

By Jeff Dean (http://research.google.com/people/jeff/):


Why is _source filtering faster than stored fields retrieval?
(Daniel Schnell) #2

Do you mean each field in a separate index ?

Am 31.05.2012 um 15:20 schrieb Clinton Gormley:

Jeff Dean from Google gisted some latency numbers, which make for
interesting reading.

This is why you shouldn't normally store individual fields separately
(because each field requires a disk seek):

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

By Jeff Dean (http://research.google.com/people/jeff/):


(Clinton Gormley) #3

On Thu, 2012-05-31 at 21:38 +0200, Daniel Schnell wrote:

Do you mean each field in a separate index ?

I mean:

  • by default ES stores your JSON doc in the _source field, which is
    set to "stored"

  • by default, the fields in your JSON doc are set to NOT be "stored"
    (ie stored as a separate field)

  • so when ES returns your doc (search or get) it just load the _source
    field and returns that, ie a single disk seek

Some people think that by storing individual fields, it will be faster
than loading the whole JSON doc from the _source field. What they don't
realise is that each stored field requires a disk seek (10ms each seek!
), and that the sum of those seeks far outweighs the cost of just
sending the _source field.

In other words, it is almost always a false optimization.

clint

Am 31.05.2012 um 15:20 schrieb Clinton Gormley:

Jeff Dean from Google gisted some latency numbers, which make for
interesting reading.

This is why you shouldn't normally store individual fields separately
(because each field requires a disk seek):

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

By Jeff Dean (http://research.google.com/people/jeff/):


(system) #4