Per document _source exclusion after indexing but before storage


(btiernay) #1

Hi all,

I have some significant skew in document sizes in a particular index. Some
docs can be 1k and some can be as large~3g. However, the number of very
large documents is very small. As a result, any queries that match the
large documents require streaming the entire document from disk. This
results in extremely long search response latencies. The situation is
compounded since these large documents usually get hit together, and
typically end up on the same page.

Ideally, I would like to use exclude on _source on a per document basis
when indexing as specifying this in the mapping is too general.
Additionally, I would like the exclusion to take place *after *indexing but
before being stored. Is there any way to achieve this effect, perhaps using
an update request?

Secondly, when using a mapping with _source excludeshttp://www.elasticsearch.org/guide/reference/mapping/source-field/,
does this exclusion happen before or after indexing? Is this configurable?

Thank you advance,

Bob

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(btiernay) #2

Looks like the answer to my second question is *after *indexing, which is a
good thing :slight_smile: Perhaps this should be added to the docs as it seems unclear.

On Sunday, 15 September 2013 18:33:16 UTC-4, btiernay wrote:

Hi all,

I have some significant skew in document sizes in a particular index. Some
docs can be 1k and some can be as large~3g. However, the number of very
large documents is very small. As a result, any queries that match the
large documents require streaming the entire document from disk. This
results in extremely long search response latencies. The situation is
compounded since these large documents usually get hit together, and
typically end up on the same page.

Ideally, I would like to use exclude on _source on a per document basis
when indexing as specifying this in the mapping is too general.
Additionally, I would like the exclusion to take place *after *indexing
but before being stored. Is there any way to achieve this effect, perhaps
using an update request?

Secondly, when using a mapping with _source excludeshttp://www.elasticsearch.org/guide/reference/mapping/source-field/,
does this exclusion happen before or after indexing? Is this configurable?

Thank you advance,

Bob

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Zachary Tong) #3

Hmm, there is a ignore_above setting which allows you to avoid analyzing
large fields...I wonder if a similar concept could be added for
including/excluding source? Looking at the code, it shouldn't be difficult
to filter based on the length of the field. However, I'm not sure what
will happen if some docs are missing sources while other's aren't.

I'm not aware of a way to ignore it on a per-doc basis either. Right now,
I believe the only available options are entire-field inclusions/exclusions.

-Zach

On Sunday, September 15, 2013 7:58:13 PM UTC-4, btiernay wrote:

Looks like the answer to my second question is *after *indexing, which is
a good thing :slight_smile: Perhaps this should be added to the docs as it seems
unclear.

On Sunday, 15 September 2013 18:33:16 UTC-4, btiernay wrote:

Hi all,

I have some significant skew in document sizes in a particular index.
Some docs can be 1k and some can be as large~3g. However, the number of
very large documents is very small. As a result, any queries that match the
large documents require streaming the entire document from disk. This
results in extremely long search response latencies. The situation is
compounded since these large documents usually get hit together, and
typically end up on the same page.

Ideally, I would like to use exclude on _source on a per document basis
when indexing as specifying this in the mapping is too general.
Additionally, I would like the exclusion to take place *after *indexing
but before being stored. Is there any way to achieve this effect, perhaps
using an update request?

Secondly, when using a mapping with _source excludeshttp://www.elasticsearch.org/guide/reference/mapping/source-field/,
does this exclusion happen before or after indexing? Is this configurable?

Thank you advance,

Bob

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #4

Why is length cutting of source at client side not appropriate? Moving ~3g
over the API to the ES JVM heap and processing it by ES just for a cut, or
an analyzer/tokenizer which does the same, is quite heavy pressure on the
server side. Even Google stops the crawler at around 2MB of a source
document afaik.

Jörg

On Mon, Sep 16, 2013 at 12:33 AM, btiernay rtiernay@gmail.com wrote:

Additionally, I would like the exclusion to take place *after *indexing
but before being stored.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(btiernay) #5

Hi Jorg,

This is a cancer genomics application where the data is highly structured /
nested (tree) representing pre-joined data. In this particular case, the
document represents a donor which has been heavily mutated. Truncating the
document would be throwing out valuable research data.

Keep in mind that we actually do want to index these large documents. Thus
we wouldn`t want to to cut at analysis / tokenization. I have confirmed
that search time is not the issue, just document retrieval.

For now we might be able to get away with using _source excludes since for
this particular index we don't seem to be using the lower levels of the
tree and thus can be safely pruned from source. However, having greater
flexibility when storing source (e.g. per document control) may be a
generally useful feature.

Cheers,

Bob

On Monday, 16 September 2013 03:03:43 UTC-4, Jörg Prante wrote:

Why is length cutting of source at client side not appropriate? Moving ~3g
over the API to the ES JVM heap and processing it by ES just for a cut, or
an analyzer/tokenizer which does the same, is quite heavy pressure on the
server side. Even Google stops the crawler at around 2MB of a source
document afaik.

Jörg

On Mon, Sep 16, 2013 at 12:33 AM, btiernay <rtie...@gmail.com<javascript:>

wrote:

Additionally, I would like the exclusion to take place *after *indexing
but before being stored.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6