Jeff Dean from Google gisted some latency numbers, which make for
interesting reading.
This is why you shouldn't normally store individual fields separately
(because each field requires a disk seek):
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
By Jeff Dean (http://research.google.com/people/jeff/):
Do you mean each field in a separate index ?
Am 31.05.2012 um 15:20 schrieb Clinton Gormley:
Jeff Dean from Google gisted some latency numbers, which make for
interesting reading.
This is why you shouldn't normally store individual fields separately
(because each field requires a disk seek):
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
By Jeff Dean (Jeffrey Dean – Google Research):
On Thu, 2012-05-31 at 21:38 +0200, Daniel Schnell wrote:
Do you mean each field in a separate index ?
I mean:
-
by default ES stores your JSON doc in the _source field, which is
set to "stored"
-
by default, the fields in your JSON doc are set to NOT be "stored"
(ie stored as a separate field)
-
so when ES returns your doc (search or get) it just load the _source
field and returns that, ie a single disk seek
Some people think that by storing individual fields, it will be faster
than loading the whole JSON doc from the _source field. What they don't
realise is that each stored field requires a disk seek (10ms each seek!
), and that the sum of those seeks far outweighs the cost of just
sending the _source field.
In other words, it is almost always a false optimization.
clint
Am 31.05.2012 um 15:20 schrieb Clinton Gormley:
Jeff Dean from Google gisted some latency numbers, which make for
interesting reading.
This is why you shouldn't normally store individual fields separately
(because each field requires a disk seek):
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
By Jeff Dean (Jeffrey Dean – Google Research):