Delayed materialization of the full result set and CRC/Hash-style diff check

I struggled with coming up with the subject line for this one, so please
bare with me.

We have when searching, historically returned all result IDs from our
Lucene based search infrastructure. Yes, NOT a good idea we have copped
some interesting performance characteristics and had to do some heavy
internal Lucene customization/forking to make it work, but let me explain
the 'rationale'.

If we have a feature to perform a search, we generally only need to display:

  • total hits

  • Result 1-X based on a page size
  • Perhaps hold a few pages of results, and then show a "More" button or
    something (hopefully no-one is silly to want to go to result # million..)

That's fine, usual use case. Then when a user may wish to export these
results to some external mechanism (classic: Excel), we do need to
materialize the whole result set. Now, the _scan API is there, but doesn't
support the sorting, and we need to maintain the sort for the export to
Excel. The scrolling mechanism allows us to do that, but I'm wondering
about the consistency level between each scroll iteration.

The right way to do this would be to only materialize the whole resultset
when you need to do the Excel, HOWEVER, this introduces a latency between
the original search result, and the one that is done for the export,
resulting in differing results between the two if there's a high update rate
(our problem).

What I would like to be able to do is actually compute a Hash of the result
for the 'top results' based on the id's and then when exporting, compute the
hash again, and if the hash is different you know you have a difference
between the original and the export. I'd like to be able to show our
customers a little note "Warning: These results may be different from when
you first searched".

Perhaps I need to convince the Product Owners to just suck this one up and
we'll assume that any export the results may have differed. I guess if we
can determine that they haven't, the customer can have more confidence. In
our cases this can be important for their decision making (won't bore you
with the details).

I was thinking, that maybe, the mvel script stuff could be used to compute
some sort of hash and returned with the results.. ? Is that practical? Is
it crazy? Is there another smarter way of handling that ?

thanks,

Paul

First, regarding scrolling. When you scroll, a point in time (the first
scroll) view is "created", so as you scroll, you keep scrolling against the
same view of the data. If things change, then they won't be reflected.

If I got the hashing idea right, then I think it will be very expensive.
You can get when was the last refresh that happened on each shard, but,
thats only the information on the refresh of that shard, it might not affect
the data being searched.

You can use scrolling to try and solve it. You can scroll your search
request, and if an export is needed, just continue to scroll the data set
for the export. Note though, that non scan scroll can get expensive,
especially if the result set is big.

On Thu, Jul 28, 2011 at 3:12 AM, Paul Smith tallpsmith@gmail.com wrote:

I struggled with coming up with the subject line for this one, so please
bare with me.

We have when searching, historically returned all result IDs from our
Lucene based search infrastructure. Yes, NOT a good idea we have copped
some interesting performance characteristics and had to do some heavy
internal Lucene customization/forking to make it work, but let me explain
the 'rationale'.

If we have a feature to perform a search, we generally only need to
display:

  • total hits

  • Result 1-X based on a page size
  • Perhaps hold a few pages of results, and then show a "More" button or
    something (hopefully no-one is silly to want to go to result # million..)

That's fine, usual use case. Then when a user may wish to export these
results to some external mechanism (classic: Excel), we do need to
materialize the whole result set. Now, the _scan API is there, but doesn't
support the sorting, and we need to maintain the sort for the export to
Excel. The scrolling mechanism allows us to do that, but I'm wondering
about the consistency level between each scroll iteration.

The right way to do this would be to only materialize the whole resultset
when you need to do the Excel, HOWEVER, this introduces a latency between
the original search result, and the one that is done for the export,
resulting in differing results between the two if there's a high update rate
(our problem).

What I would like to be able to do is actually compute a Hash of the result
for the 'top results' based on the id's and then when exporting, compute the
hash again, and if the hash is different you know you have a difference
between the original and the export. I'd like to be able to show our
customers a little note "Warning: These results may be different from when
you first searched".

Perhaps I need to convince the Product Owners to just suck this one up and
we'll assume that any export the results may have differed. I guess if we
can determine that they haven't, the customer can have more confidence. In
our cases this can be important for their decision making (won't bore you
with the details).

I was thinking, that maybe, the mvel script stuff could be used to compute
some sort of hash and returned with the results.. ? Is that practical? Is
it crazy? Is there another smarter way of handling that ?

thanks,

Paul

On Thursday, 28 July 2011, Shay Banon kimchy@gmail.com wrote:

First, regarding scrolling. When you scroll, a point in time (the first
scroll) view is "created", so as you scroll, you keep scrolling against the
same view of the data. If things change, then they won't be reflected.

So copy-on-write style semantics. Ok that's nice.

If I got the hashing idea right, then I think it will be very expensive.
You can get when was the last refresh that happened on each shard, but,
thats only the information on the refresh of that shard, it might not affect
the data being searched.

Time of the refresh won't be enough because the data is multitenant then any
change from any user will affect that timestamp but not necessarily affect
the data.

Can you explain why you think the hashing would e expensive? I'm just
curious to understand why, I'll trust tour judgement obviously but i like
understanding internals.

You can use scrolling to try and solve it. You can scroll your search
request, and if an export is needed, just continue to scroll the data set
for the export. Note though, that non scan scroll can get expensive,
especially if the result set is big.

IIRC the scroll API requires a timeout settig so re ES server can tear down
resources which is difficult to tell for us. Might work but also long lived
scrolls might get expensive.

The results indeed can be big. Alas we're talking millions of items at
times.

If the results are spread across a good number of shards the sorted results
can be merged sorted in an iteration style efficiently?

I guess we should discuss How does ES deal with a very large number of
results in sorted (not by score)? How does the node handling the request
merge in the shard-level results?

Appreciate your time on this, it's quite critical for our migration of our
index infrastructure which has to cope with this (end up being a very large
initialized PriorityQueue of size <#results>, lots of memory... But only one
shard on one host).

Paul Smith

On Thu, Jul 28, 2011 at 10:47 AM, Paul Smith tallpsmith@gmail.com wrote:

On Thursday, 28 July 2011, Shay Banon kimchy@gmail.com wrote:

First, regarding scrolling. When you scroll, a point in time (the
first scroll) view is "created", so as you scroll, you keep scrolling
against the same view of the data. If things change, then they won't be
reflected.

So copy-on-write style semantics. Ok that's nice.

If I got the hashing idea right, then I think it will be very
expensive. You can get when was the last refresh that happened on each
shard, but, thats only the information on the refresh of that shard, it
might not affect the data being searched.

Time of the refresh won't be enough because the data is multitenant then
any change from any user will affect that timestamp but not necessarily
affect the data.

Can you explain why you think the hashing would e expensive? I'm just
curious to understand why, I'll trust tour judgement obviously but i like
understanding internals.

What are you going to hash on? Any value you will choose to hash on will
need to be loaded to memory to be used in a script for it to perform well
(more memory requirements), and, md5 hashing for example on 5 million values
can be expensive.

You can use scrolling to try and solve it. You can scroll your search
request, and if an export is needed, just continue to scroll the data set
for the export. Note though, that non scan scroll can get expensive,
especially if the result set is big.

IIRC the scroll API requires a timeout settig so re ES server can tear down
resources which is difficult to tell for us. Might work but also long lived
scrolls might get expensive.

Yes, it has to have a timeout. What is kept around are references to the
index files, so if things changes and those index files gets merged out,
they won't be deleted until the last scroll releases the handle on it. It
means more memory and more size on disk.

The results indeed can be big. Alas we're talking millions of items at
times.

If the results are spread across a good number of shards the sorted results
can be merged sorted in an iteration style efficiently?

I guess we should discuss How does ES deal with a very large number of
results in sorted (not by score)? How does the node handling the request
merge in the shard-level results?

Appreciate your time on this, it's quite critical for our migration of our
index infrastructure which has to cope with this (end up being a very large
initialized PriorityQueue of size <#results>, lots of memory... But only one
shard on one host).

It depends which search type you use, query_and_fetch or
query_then_fetch. For both, each shard builds a priority queue with the
size of "from + size" search request in order to properly return sorted
results (even for score).

For query_then_fetch, the ids and sorting values are returned for
"from+size" for each shard, and resorted on the node doing the scatter
gather. Then, the relevant ids are fetched from the relevant shards.

For query_and_fetch, only size are returned from each shard request
(possibly adjusted by from), with the data itself. This means faster and
less constrains on the node doing the resorting process, but, it can lead to
results not being totally ordered.

Paul Smith

What are you going to hash on? Any value you will choose to hash on will
need to be loaded to memory to be used in a script for it to perform well
(more memory requirements), and, md5 hashing for example on 5 million values
can be expensive.

Just the ID. The computed hash of the results is the hash of all IDs in
that result. Any change in the result means a change in the sequence of
ID's, and hence a different Hash.

It depends which search type you use, query_and_fetch or
query_then_fetch. For both, each shard builds a priority queue with the
size of "from + size" search request in order to properly return sorted
results (even for score).

For query_then_fetch, the ids and sorting values are returned for
"from+size" for each shard, and resorted on the node doing the scatter
gather. Then, the relevant ids are fetched from the relevant shards.

For query_and_fetch, only size are returned from each shard request
(possibly adjusted by from), with the data itself. This means faster and
less constrains on the node doing the resorting process, but, it can lead to
results not being totally ordered.

sounds like we'd have to use query_then_fetch, ordering is important. We
would do an unsorted query first to get the size of the results (with 0
results returned, fastest) and then reissue a query_then_fetch to get the
right amount of results.

Anyone else do large sorts and full result set materialization? We can't be
the only ones with a requirement to return full results in sorted order?
Any other suggestions anyone?

Paul