Using serialized doc_value instead of _source to improve read latency

Hi,

We are having a performance problem in which for each hit, elasticsearch
parses the entire _source then generates a new Json with only the requested
query _source fields. In order to overcome this issue we would like to use
mapping transform script that serializes the requested query fields (which
is known in advance) into a doc_value. Does that makes sense?

The actual problem with the transform script is SecurityException that
does not allow using any json serialization mechanism. A binary
serialization would also be ok.

Itai

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

This is how _source works. doc_values don't make sense in this regard -
what you are looking for is using stored fields and have the transform
script write to that. Loading stored fields (even one field per hit) may be
slower than loading and parsing _source, though.

I'd just put this logic in the indexer, though. It will definitely help
with other things as well, such as nasty huge mappings.

Alternatively, find a way to avoid IO completely. How about using ES for
search and something like riak for loading the actual data, if IO costs are
so noticable?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel itaifrenkel@live.com wrote:

Hi,

We are having a performance problem in which for each hit, elasticsearch
parses the entire _source then generates a new Json with only the requested
query _source fields. In order to overcome this issue we would like to use
mapping transform script that serializes the requested query fields (which
is known in advance) into a doc_value. Does that makes sense?

The actual problem with the transform script is SecurityException that
does not allow using any json serialization mechanism. A binary
serialization would also be ok.

Itai

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zsmri8LvzAqnXrwCA7B2PesCtH05BQxmj%3D3vMr%2B9abikw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Itamar,

  1. The _source field includes many fields that are only being indexed, and
    many fields that are only needed as a query search result. _source includes
    them both.The projection from _source from the query result is too CPU
    intensive to do during search time for each result, especially if the size
    is big.
  2. I agree that adding another NoSQL could solve this problem, however it
    is currently out of scope, as it would require syncing data with another
    data store.
  3. Wouldn't a big stored field will bloat the lucene index size? Even if
    not, isn't non_analyzed fields are destined to be (or already are)
    doc_fields?

On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:

This is how _source works. doc_values don't make sense in this regard -
what you are looking for is using stored fields and have the transform
script write to that. Loading stored fields (even one field per hit) may be
slower than loading and parsing _source, though.

I'd just put this logic in the indexer, though. It will definitely help
with other things as well, such as nasty huge mappings.

Alternatively, find a way to avoid IO completely. How about using ES for
search and something like riak for loading the actual data, if IO costs are
so noticable?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel <itaif...@live.com
<javascript:>> wrote:

Hi,

We are having a performance problem in which for each hit, elasticsearch
parses the entire _source then generates a new Json with only the requested
query _source fields. In order to overcome this issue we would like to use
mapping transform script that serializes the requested query fields (which
is known in advance) into a doc_value. Does that makes sense?

The actual problem with the transform script is SecurityException that
does not allow using any json serialization mechanism. A binary
serialization would also be ok.

Itai

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Also - does "fielddata": { "loading": "eager" } makes sense with
doc_values in this use case? Would that combination be supported in the
future?

On Tuesday, April 21, 2015 at 2:14:03 AM UTC+3, Itai Frenkel wrote:

Itamar,

  1. The _source field includes many fields that are only being indexed, and
    many fields that are only needed as a query search result. _source includes
    them both.The projection from _source from the query result is too CPU
    intensive to do during search time for each result, especially if the size
    is big.
  2. I agree that adding another NoSQL could solve this problem, however it
    is currently out of scope, as it would require syncing data with another
    data store.
  3. Wouldn't a big stored field will bloat the lucene index size? Even if
    not, isn't non_analyzed fields are destined to be (or already are)
    doc_fields?

On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:

This is how _source works. doc_values don't make sense in this regard -
what you are looking for is using stored fields and have the transform
script write to that. Loading stored fields (even one field per hit) may be
slower than loading and parsing _source, though.

I'd just put this logic in the indexer, though. It will definitely help
with other things as well, such as nasty huge mappings.

Alternatively, find a way to avoid IO completely. How about using ES for
search and something like riak for loading the actual data, if IO costs are
so noticable?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel itaif...@live.com wrote:

Hi,

We are having a performance problem in which for each hit, elasticsearch
parses the entire _source then generates a new Json with only the requested
query _source fields. In order to overcome this issue we would like to use
mapping transform script that serializes the requested query fields (which
is known in advance) into a doc_value. Does that makes sense?

The actual problem with the transform script is SecurityException that
does not allow using any json serialization mechanism. A binary
serialization would also be ok.

Itai

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d5abaeac-ff16-45ac-bb3d-62b53e497795%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

What if all those fields are collapsed to one, like you suggest, but that
one field is projected out of _source (think non-indexed json in a string
field)? do you see a noticable performance gain then?

What if that field is set to be stored (and loaded using fields, not via
_source)? what is the performance gain then?

Fielddata and the doc_values optimization on top of them will not help you
here, those data structures aren't being used for sending data out, only
for aggregations and sorting. Also, using fielddata will require indexing
those fields; it is apparent that you are not looking to be doing that.

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Tue, Apr 21, 2015 at 12:14 AM, Itai Frenkel itaifrenkel@live.com wrote:

Itamar,

  1. The _source field includes many fields that are only being indexed, and
    many fields that are only needed as a query search result. _source includes
    them both.The projection from _source from the query result is too CPU
    intensive to do during search time for each result, especially if the size
    is big.
  2. I agree that adding another NoSQL could solve this problem, however it
    is currently out of scope, as it would require syncing data with another
    data store.
  3. Wouldn't a big stored field will bloat the lucene index size? Even if
    not, isn't non_analyzed fields are destined to be (or already are)
    doc_fields?

On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:

This is how _source works. doc_values don't make sense in this regard -
what you are looking for is using stored fields and have the transform
script write to that. Loading stored fields (even one field per hit) may be
slower than loading and parsing _source, though.

I'd just put this logic in the indexer, though. It will definitely help
with other things as well, such as nasty huge mappings.

Alternatively, find a way to avoid IO completely. How about using ES for
search and something like riak for loading the actual data, if IO costs are
so noticable?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel itaif...@live.com wrote:

Hi,

We are having a performance problem in which for each hit, elasticsearch
parses the entire _source then generates a new Json with only the requested
query _source fields. In order to overcome this issue we would like to use
mapping transform script that serializes the requested query fields (which
is known in advance) into a doc_value. Does that makes sense?

The actual problem with the transform script is SecurityException that
does not allow using any json serialization mechanism. A binary
serialization would also be ok.

Itai

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4ZuxvUoZ4L%2BUq0G82GLZKYfN-hj_e_gez6RsUc3hZeHbyw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

A quick check shows there is no significant performance gain between
doc_value and stored field that is not a doc value. I suppose there are
warm-up and file system caching issues are at play. I do not have that
field in the source since the ETL process at this point does not generate
it. The ETL could be fixed and then it will generate the required field.
However, even then I would still prefer doc_field over _source since I do
not need _source at all. You are right to assume that reading the entire
source parsing it and returning only one field would be fast (since the cpu
is in the json generator I suspect, and not the parser, but that requires
more work).

On Tuesday, April 21, 2015 at 2:25:22 AM UTC+3, Itamar Syn-Hershko wrote:

What if all those fields are collapsed to one, like you suggest, but that
one field is projected out of _source (think non-indexed json in a string
field)? do you see a noticable performance gain then?

What if that field is set to be stored (and loaded using fields, not via
_source)? what is the performance gain then?

Fielddata and the doc_values optimization on top of them will not help you
here, those data structures aren't being used for sending data out, only
for aggregations and sorting. Also, using fielddata will require indexing
those fields; it is apparent that you are not looking to be doing that.

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Tue, Apr 21, 2015 at 12:14 AM, Itai Frenkel <itaif...@live.com
<javascript:>> wrote:

Itamar,

  1. The _source field includes many fields that are only being indexed,
    and many fields that are only needed as a query search result. _source
    includes them both.The projection from _source from the query result is too
    CPU intensive to do during search time for each result, especially if the
    size is big.
  2. I agree that adding another NoSQL could solve this problem, however it
    is currently out of scope, as it would require syncing data with another
    data store.
  3. Wouldn't a big stored field will bloat the lucene index size? Even if
    not, isn't non_analyzed fields are destined to be (or already are)
    doc_fields?

On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:

This is how _source works. doc_values don't make sense in this regard -
what you are looking for is using stored fields and have the transform
script write to that. Loading stored fields (even one field per hit) may be
slower than loading and parsing _source, though.

I'd just put this logic in the indexer, though. It will definitely help
with other things as well, such as nasty huge mappings.

Alternatively, find a way to avoid IO completely. How about using ES for
search and something like riak for loading the actual data, if IO costs are
so noticable?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel itaif...@live.com
wrote:

Hi,

We are having a performance problem in which for each hit,
elasticsearch parses the entire _source then generates a new Json with only
the requested query _source fields. In order to overcome this issue we
would like to use mapping transform script that serializes the requested
query fields (which is known in advance) into a doc_value. Does that makes
sense?

The actual problem with the transform script is SecurityException that
does not allow using any json serialization mechanism. A binary
serialization would also be ok.

Itai

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8fd7a5d2-77c7-4758-8c28-82f517131660%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Have you profiled it and seen that reading the source is actually the slow
part? hot_threads can lie here so I'd go with a profiler or just sigquit or
something.

I've got some reasonably big documents and generally don't see that as a
problem even under decent load.

I could see an argument for a second source field with the long stuff
removed if you see the json decode or the disk read of the source be really
slow - but transform doesn't do that.

Nik

On Mon, Apr 20, 2015 at 7:57 PM, Itai Frenkel itaifrenkel@live.com wrote:

A quick check shows there is no significant performance gain between
doc_value and stored field that is not a doc value. I suppose there are
warm-up and file system caching issues are at play. I do not have that
field in the source since the ETL process at this point does not generate
it. The ETL could be fixed and then it will generate the required field.
However, even then I would still prefer doc_field over _source since I do
not need _source at all. You are right to assume that reading the entire
source parsing it and returning only one field would be fast (since the cpu
is in the json generator I suspect, and not the parser, but that requires
more work).

On Tuesday, April 21, 2015 at 2:25:22 AM UTC+3, Itamar Syn-Hershko wrote:

What if all those fields are collapsed to one, like you suggest, but that
one field is projected out of _source (think non-indexed json in a string
field)? do you see a noticable performance gain then?

What if that field is set to be stored (and loaded using fields, not via
_source)? what is the performance gain then?

Fielddata and the doc_values optimization on top of them will not help
you here, those data structures aren't being used for sending data out,
only for aggregations and sorting. Also, using fielddata will require
indexing those fields; it is apparent that you are not looking to be doing
that.

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Tue, Apr 21, 2015 at 12:14 AM, Itai Frenkel itaif...@live.com wrote:

Itamar,

  1. The _source field includes many fields that are only being indexed,
    and many fields that are only needed as a query search result. _source
    includes them both.The projection from _source from the query result is too
    CPU intensive to do during search time for each result, especially if the
    size is big.
  2. I agree that adding another NoSQL could solve this problem, however
    it is currently out of scope, as it would require syncing data with another
    data store.
  3. Wouldn't a big stored field will bloat the lucene index size? Even if
    not, isn't non_analyzed fields are destined to be (or already are)
    doc_fields?

On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko wrote:

This is how _source works. doc_values don't make sense in this regard -
what you are looking for is using stored fields and have the transform
script write to that. Loading stored fields (even one field per hit) may be
slower than loading and parsing _source, though.

I'd just put this logic in the indexer, though. It will definitely help
with other things as well, such as nasty huge mappings.

Alternatively, find a way to avoid IO completely. How about using ES
for search and something like riak for loading the actual data, if IO costs
are so noticable?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel itaif...@live.com
wrote:

Hi,

We are having a performance problem in which for each hit,
elasticsearch parses the entire _source then generates a new Json with only
the requested query _source fields. In order to overcome this issue we
would like to use mapping transform script that serializes the requested
query fields (which is known in advance) into a doc_value. Does that makes
sense?

The actual problem with the transform script is SecurityException
that does not allow using any json serialization mechanism. A binary
serialization would also be ok.

Itai

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8fd7a5d2-77c7-4758-8c28-82f517131660%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8fd7a5d2-77c7-4758-8c28-82f517131660%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1Fs7xD63h0RXS8WZC-QvrnDOmfy6CUFB0VOZeCvXUHxQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Nik,

when _source : true the time it takes for the search to complete in
elasticsearch is very short. when _souce is a list of fields it is
significantly slower.

Itai

On Tuesday, April 21, 2015 at 3:06:06 AM UTC+3, Nikolas Everett wrote:

Have you profiled it and seen that reading the source is actually the slow
part? hot_threads can lie here so I'd go with a profiler or just sigquit or
something.

I've got some reasonably big documents and generally don't see that as a
problem even under decent load.

I could see an argument for a second source field with the long stuff
removed if you see the json decode or the disk read of the source be really
slow - but transform doesn't do that.

Nik

On Mon, Apr 20, 2015 at 7:57 PM, Itai Frenkel <itaif...@live.com
<javascript:>> wrote:

A quick check shows there is no significant performance gain between
doc_value and stored field that is not a doc value. I suppose there are
warm-up and file system caching issues are at play. I do not have that
field in the source since the ETL process at this point does not generate
it. The ETL could be fixed and then it will generate the required field.
However, even then I would still prefer doc_field over _source since I do
not need _source at all. You are right to assume that reading the entire
source parsing it and returning only one field would be fast (since the cpu
is in the json generator I suspect, and not the parser, but that requires
more work).

On Tuesday, April 21, 2015 at 2:25:22 AM UTC+3, Itamar Syn-Hershko wrote:

What if all those fields are collapsed to one, like you suggest, but
that one field is projected out of _source (think non-indexed json in a
string field)? do you see a noticable performance gain then?

What if that field is set to be stored (and loaded using fields, not via
_source)? what is the performance gain then?

Fielddata and the doc_values optimization on top of them will not help
you here, those data structures aren't being used for sending data out,
only for aggregations and sorting. Also, using fielddata will require
indexing those fields; it is apparent that you are not looking to be doing
that.

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Tue, Apr 21, 2015 at 12:14 AM, Itai Frenkel itaif...@live.com
wrote:

Itamar,

  1. The _source field includes many fields that are only being indexed,
    and many fields that are only needed as a query search result. _source
    includes them both.The projection from _source from the query result is too
    CPU intensive to do during search time for each result, especially if the
    size is big.
  2. I agree that adding another NoSQL could solve this problem, however
    it is currently out of scope, as it would require syncing data with another
    data store.
  3. Wouldn't a big stored field will bloat the lucene index size? Even
    if not, isn't non_analyzed fields are destined to be (or already are)
    doc_fields?

On Tuesday, April 21, 2015 at 1:36:20 AM UTC+3, Itamar Syn-Hershko
wrote:

This is how _source works. doc_values don't make sense in this regard

  • what you are looking for is using stored fields and have the transform
    script write to that. Loading stored fields (even one field per hit) may be
    slower than loading and parsing _source, though.

I'd just put this logic in the indexer, though. It will definitely
help with other things as well, such as nasty huge mappings.

Alternatively, find a way to avoid IO completely. How about using ES
for search and something like riak for loading the actual data, if IO costs
are so noticable?

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Mon, Apr 20, 2015 at 11:18 PM, Itai Frenkel itaif...@live.com
wrote:

Hi,

We are having a performance problem in which for each hit,
elasticsearch parses the entire _source then generates a new Json with only
the requested query _source fields. In order to overcome this issue we
would like to use mapping transform script that serializes the requested
query fields (which is known in advance) into a doc_value. Does that makes
sense?

The actual problem with the transform script is SecurityException
that does not allow using any json serialization mechanism. A binary
serialization would also be ok.

Itai

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b897aba2-c250-4474-a03f-1d2a993baef9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/630a2998-e2a9-44a3-9c93-e692be2c2338%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8fd7a5d2-77c7-4758-8c28-82f517131660%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8fd7a5d2-77c7-4758-8c28-82f517131660%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0aff9959-4c66-4b82-8e09-082b743642e3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

If I could focus the question better : How do I whitelist a specific class
in the groovy script inside transform ?

On Tuesday, April 21, 2015 at 1:18:03 AM UTC+3, Itai Frenkel wrote:

Hi,

We are having a performance problem in which for each hit, elasticsearch
parses the entire _source then generates a new Json with only the requested
query _source fields. In order to overcome this issue we would like to use
mapping transform script that serializes the requested query fields (which
is known in advance) into a doc_value. Does that makes sense?

The actual problem with the transform script is SecurityException that
does not allow using any json serialization mechanism. A binary
serialization would also be ok.

Itai

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e925c3b6-b102-413c-a320-62f1c0ffcf99%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

The answer is these changes in elasticsearch.yml:
script.groovy.sandbox.class_whitelist:
com.fasterxml.jackson.databind.ObjectMapper
script.groovy.sandbox.package_whitelist: com.fasterxml.jackson.databind

for some reason these classes are not shaded even though the pom.xml does
shade them.

On Tuesday, April 21, 2015 at 5:21:58 AM UTC+3, Itai Frenkel wrote:

If I could focus the question better : How do I whitelist a specific
class in the groovy script inside transform ?

On Tuesday, April 21, 2015 at 1:18:03 AM UTC+3, Itai Frenkel wrote:

Hi,

We are having a performance problem in which for each hit, elasticsearch
parses the entire _source then generates a new Json with only the requested
query _source fields. In order to overcome this issue we would like to use
mapping transform script that serializes the requested query fields (which
is known in advance) into a doc_value. Does that makes sense?

The actual problem with the transform script is SecurityException that
does not allow using any json serialization mechanism. A binary
serialization would also be ok.

Itai

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b7787495-500b-4ed7-b0e6-4fad7fda1aa2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.