Impact of stored fields on performance

I recently added a binary type field to all documents with mapping "store":
"true". The field contents are large and as a result the on-disk index
size rose by 3x, from 2.5Gb/shard to ~8Gb/shard.

After this change I've seen a big jump in query latency. Searches which
previously took 40-60ms now take 800ms and longer. This is the case even
for queries which don't return the binary field.
I tried optimizing the index down to max_num_segments=1, but query latency
remains high.

Is this expected? Obviously queries returning the new field will take a
hit (since field data needs to be loaded from disk). But I would've
expected other queries should not be much affected.

Is the problem that larger file sizes make memory-mapping and the FS cache
less efficient? Or are stored fields still getting loaded from disk even
when not included in the "fields" term?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6ef50cab-3004-490b-bc2d-ea7e71a824a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Ashish,

How many documents do your queries typically retrieve? (the value of the
size parameter)

On Tue, Aug 12, 2014 at 12:48 AM, Ashish Mishra laughingbuddha@gmail.com
wrote:

I recently added a binary type field to all documents with mapping
"store": "true". The field contents are large and as a result the on-disk
index size rose by 3x, from 2.5Gb/shard to ~8Gb/shard.

After this change I've seen a big jump in query latency. Searches which
previously took 40-60ms now take 800ms and longer. This is the case even
for queries which don't return the binary field.
I tried optimizing the index down to max_num_segments=1, but query latency
remains high.

Is this expected? Obviously queries returning the new field will take a
hit (since field data needs to be loaded from disk). But I would've
expected other queries should not be much affected.

Is the problem that larger file sizes make memory-mapping and the FS cache
less efficient? Or are stored fields still getting loaded from disk even
when not included in the "fields" term?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6ef50cab-3004-490b-bc2d-ea7e71a824a5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6ef50cab-3004-490b-bc2d-ea7e71a824a5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j572SMo4rHOBGLRNMwHEv35WskjqZGMgpiJedYxAOP6-g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

The query size parameter is 200.
Actual hit totals vary widely, generally around 1000-10000. A minority are
much lower. About 10% of queries end up with just 1 or 0 hits.

On Tuesday, August 12, 2014 6:31:29 AM UTC-7, Adrien Grand wrote:

Hi Ashish,

How many documents do your queries typically retrieve? (the value of the
size parameter)

On Tue, Aug 12, 2014 at 12:48 AM, Ashish Mishra <laughin...@gmail.com
<javascript:>> wrote:

I recently added a binary type field to all documents with mapping
"store": "true". The field contents are large and as a result the on-disk
index size rose by 3x, from 2.5Gb/shard to ~8Gb/shard.

After this change I've seen a big jump in query latency. Searches which
previously took 40-60ms now take 800ms and longer. This is the case even
for queries which don't return the binary field.
I tried optimizing the index down to max_num_segments=1, but query
latency remains high.

Is this expected? Obviously queries returning the new field will take a
hit (since field data needs to be loaded from disk). But I would've
expected other queries should not be much affected.

Is the problem that larger file sizes make memory-mapping and the FS
cache less efficient? Or are stored fields still getting loaded from disk
even when not included in the "fields" term?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/6ef50cab-3004-490b-bc2d-ea7e71a824a5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6ef50cab-3004-490b-bc2d-ea7e71a824a5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1105d739-114e-4047-994e-aba8e27066b3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

OK, so quite large pages. Another question would be how much memory you
have on each node, how much is given to elasticsearch (ES_HEAP_SIZE) and
how large is the data/ directory.

For example if you used to have ${ES_HEAP_SIZE} + ${size of data} < ${total
memory of the machine}, it meant that your whole index could fit in the
filesystem cache, which is very fast (that would explain why you got such
good response times of 40-60ms in spite of having a size of 200). But if it
is greater now, it would mean that the disk needs to often perform actual
seeks (on magnetic storage, that would be around 5 to 10ms per seek) which
can highly degrade the latency.

On Tue, Aug 12, 2014 at 11:33 PM, Ashish Mishra laughingbuddha@gmail.com
wrote:

The query size parameter is 200.
Actual hit totals vary widely, generally around 1000-10000. A minority
are much lower. About 10% of queries end up with just 1 or 0 hits.

On Tuesday, August 12, 2014 6:31:29 AM UTC-7, Adrien Grand wrote:

Hi Ashish,

How many documents do your queries typically retrieve? (the value of the
size parameter)

On Tue, Aug 12, 2014 at 12:48 AM, Ashish Mishra laughin...@gmail.com
wrote:

I recently added a binary type field to all documents with mapping
"store": "true". The field contents are large and as a result the on-disk
index size rose by 3x, from 2.5Gb/shard to ~8Gb/shard.

After this change I've seen a big jump in query latency. Searches which
previously took 40-60ms now take 800ms and longer. This is the case even
for queries which don't return the binary field.
I tried optimizing the index down to max_num_segments=1, but query
latency remains high.

Is this expected? Obviously queries returning the new field will take a
hit (since field data needs to be loaded from disk). But I would've
expected other queries should not be much affected.

Is the problem that larger file sizes make memory-mapping and the FS
cache less efficient? Or are stored fields still getting loaded from disk
even when not included in the "fields" term?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/6ef50cab-3004-490b-bc2d-ea7e71a824a5%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6ef50cab-3004-490b-bc2d-ea7e71a824a5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1105d739-114e-4047-994e-aba8e27066b3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1105d739-114e-4047-994e-aba8e27066b3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j4sgR2hTR%3DN8hVPCgiYkPJdayz2tQw7n3GpQ4pd_B9Z%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

That sounds possible. We are using spindle disks. I have ~36Gb free for
the filesystem cache, and the previous data size (without the added field)
was 60-65Gb per node. So it's likely that >50% of queries were previously
addressed out of the FS cache, even more if queries are unevenly
distributed.
Data size is now 200Gb/node. So only ~18% of queries could hit the cache
and the rest would incur seek times.

Hmm... given this knowledge, is there a way to mitigate the effect without
moving everything to SSD? Only a minority of queries return the stored
field and it is not indexed. Ideally, it would be stored in separate
(colocated) files from the indexed fields. That way, most queries would be
unaffected and only those returning the value incur the seek cost.

I imagine indexes with _source enabled would see similar effects.

Is a parent-child relationship a good way to achieve the scenario above?
The parent can contain indexed fields and the child has stored fields.
Not sure if this just introduces new problems.

On Wednesday, August 13, 2014 1:16:49 AM UTC-7, Adrien Grand wrote:

OK, so quite large pages. Another question would be how much memory you
have on each node, how much is given to elasticsearch (ES_HEAP_SIZE) and
how large is the data/ directory.

For example if you used to have ${ES_HEAP_SIZE} + ${size of data} <
${total memory of the machine}, it meant that your whole index could fit in
the filesystem cache, which is very fast (that would explain why you got
such good response times of 40-60ms in spite of having a size of 200). But
if it is greater now, it would mean that the disk needs to often perform
actual seeks (on magnetic storage, that would be around 5 to 10ms per seek)
which can highly degrade the latency.

On Tue, Aug 12, 2014 at 11:33 PM, Ashish Mishra <laughin...@gmail.com
<javascript:>> wrote:

The query size parameter is 200.
Actual hit totals vary widely, generally around 1000-10000. A minority
are much lower. About 10% of queries end up with just 1 or 0 hits.

On Tuesday, August 12, 2014 6:31:29 AM UTC-7, Adrien Grand wrote:

Hi Ashish,

How many documents do your queries typically retrieve? (the value of the
size parameter)

On Tue, Aug 12, 2014 at 12:48 AM, Ashish Mishra laughin...@gmail.com
wrote:

I recently added a binary type field to all documents with mapping
"store": "true". The field contents are large and as a result the on-disk
index size rose by 3x, from 2.5Gb/shard to ~8Gb/shard.

After this change I've seen a big jump in query latency. Searches
which previously took 40-60ms now take 800ms and longer. This is the case
even for queries which don't return the binary field.
I tried optimizing the index down to max_num_segments=1, but query
latency remains high.

Is this expected? Obviously queries returning the new field will take
a hit (since field data needs to be loaded from disk). But I would've
expected other queries should not be much affected.

Is the problem that larger file sizes make memory-mapping and the FS
cache less efficient? Or are stored fields still getting loaded from disk
even when not included in the "fields" term?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/6ef50cab-3004-490b-bc2d-ea7e71a824a5%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/6ef50cab-3004-490b-bc2d-ea7e71a824a5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1105d739-114e-4047-994e-aba8e27066b3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/1105d739-114e-4047-994e-aba8e27066b3%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ccb6caca-bee2-4628-8dda-7edad4db4a4c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Ashish,

On Thu, Aug 14, 2014 at 12:35 AM, Ashish Mishra laughingbuddha@gmail.com
wrote:

That sounds possible. We are using spindle disks. I have ~36Gb free for
the filesystem cache, and the previous data size (without the added field)
was 60-65Gb per node. So it's likely that >50% of queries were previously
addressed out of the FS cache, even more if queries are unevenly
distributed.
Data size is now 200Gb/node. So only ~18% of queries could hit the cache
and the rest would incur seek times.

Hmm... given this knowledge, is there a way to mitigate the effect without
moving everything to SSD? Only a minority of queries return the stored
field and it is not indexed. Ideally, it would be stored in separate
(colocated) files from the indexed fields. That way, most queries would be
unaffected and only those returning the value incur the seek cost.

I imagine indexes with _source enabled would see similar effects.

Is a parent-child relationship a good way to achieve the scenario above?
The parent can contain indexed fields and the child has stored fields.
Not sure if this just introduces new problems.

I think that you don't even need parent/child relations for this. If you
identify a few large stored fields that you rarely need, you could store
them in a different index with the same _id and only GET them on demand.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j48QpGoV6Gh8ns5SzrABLFmZLMjWx6iEUGea2evx06kAg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.