Disk seeks and stored fields

Clinton_Gormley · May 31, 2012, 1:20pm

Jeff Dean from Google gisted some latency numbers, which make for
interesting reading.

This is why you shouldn't normally store individual fields separately
(because each field requires a disk seek):

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

By Jeff Dean (http://research.google.com/people/jeff/):

Daniel_Schnell · May 31, 2012, 7:38pm

Do you mean each field in a separate index ?

Am 31.05.2012 um 15:20 schrieb Clinton Gormley:

Jeff Dean from Google gisted some latency numbers, which make for
interesting reading.

This is why you shouldn't normally store individual fields separately
(because each field requires a disk seek):

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

By Jeff Dean (Jeffrey Dean – Google Research):

Clinton_Gormley · June 1, 2012, 11:40am

On Thu, 2012-05-31 at 21:38 +0200, Daniel Schnell wrote:

Do you mean each field in a separate index ?

I mean:

by default ES stores your JSON doc in the _source field, which is
set to "stored"
by default, the fields in your JSON doc are set to NOT be "stored"
(ie stored as a separate field)
so when ES returns your doc (search or get) it just load the _source
field and returns that, ie a single disk seek

Some people think that by storing individual fields, it will be faster
than loading the whole JSON doc from the _source field. What they don't
realise is that each stored field requires a disk seek (10ms each seek!
), and that the sum of those seeks far outweighs the cost of just
sending the _source field.

In other words, it is almost always a false optimization.

clint

Am 31.05.2012 um 15:20 schrieb Clinton Gormley:

Jeff Dean from Google gisted some latency numbers, which make for
interesting reading.

This is why you shouldn't normally store individual fields separately
(because each field requires a disk seek):

L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns

By Jeff Dean (Jeffrey Dean – Google Research):

Topic		Replies	Views
Impact of stored fields on performance Elasticsearch	6	1307	July 6, 2017
Why is _source filtering faster than stored fields retrieval? Elasticsearch	4	2505	July 5, 2017
Stored Fields (by default)? Elasticsearch	6	483	July 6, 2017
Stored fields Elasticsearch	2	338	July 6, 2017
_source vs stored fields Elasticsearch	4	2312	July 6, 2017

Disk seeks and stored fields

Related topics