Index Verification

Hi, I've been circling around looking at ElasticSearch to replace some
home grown infrastructure around Lucene we built back in 2005. Way
back then we had the requirement that index updates had to be visibile
in a large index (many Gb's) within 5 seconds of the change, so
effectively we came up with our own version of NRT. We've been
waiting for the Lucene community to come up with their NRT, because
frankly, ours is way too hacky.. :slight_smile: I see ElasticSearch has the same
capability, perhaps based out of the post Lucene 2.9 code, or not.

One of the features we need is a way to verify the state of an index.
Since all our contents 'truth' stems from our database, we have
historically indexed a 'lastupdated' timestamp. We have triggers on
our DB tables so that the last modification in the DB is automatically
tracked, and any 'missed update' to the index can easily be determined
by comparing an items lastupdated stamp from the db compared with the
value stored in the index.

We index hundreds of millions of items, so by design we use a merge-
stream style approach for the verification to proceed quickly. We
pull the ID of the item, and it's lastupdated value from the DB in a
result set stream, sorted by ID lexicographically (more on that
later), then join this tream with one by walking the ID field term
docs/enum (lucene-based API) and pull out the lastupdated storied
field value. We then join the streams together looking for holes
(items in the db not in the index, or items still in the index that
have been deleted in the DB, or if there are timestamp mismatches).
this is the reason the id is sorted lexicographically because that's
the way we walk the term values as they are natively stored.

we deliberately don't use Lucene searches here because we need all
results for comparison, and Lucene is not great at doing searches and
returning large results back (the internal PriorityQueue causes memory
issues because it has to allocate the size of the result ahead of
time). The retrieval of each items lastupdated stamp from the stored
field area causes issues during HitCollection too anyway, it's
actually more efficient to scan the id Field term docs and pull it out
that way.

So with that in mind, I went to see how elasticSearch stores it's
indexes to see how we'd go about being able to do the same thing. The
API is really neat so I'd love to be able to do this directly via the
api and not go down low, raw, and inspect the index on disk outside of
ElasticSearch if I can, so looking for an idea on how to do this
efficiently.

With the sharding, and with the way a lot of in-memory magic is done,
I'm not exactly sure on the best way of demonstrating a proof of
concept index verification process. This also may be because
ElasticSearch is using Lucene 3.0.2, of which I'm not familiar (ours
is based on older 2.4 code).

any ideas, pointers to API would be appreciated on how I could go
about this.

cheers,

Paul Smith

Hi,

I think that you should be ok with doing a search and then scrolling it,
fetching back just the id and the lastupdate fields (if the id is the same
as the "doc" id, then you get it back for free, with the lastupdate, you can
either store it, or get it loaded into memory ala field cache). You can use
query+fetch, and since each hit is small (you don't fetch the _source back),
you can get thousands of results for each scrolling action.

The question here is if you should use elasticsearch scrolling support or
do it yourself. As you pointed out, the problem might be if the priority
queue built for each search scroll might be too big. In your case, it might
have been, because (I assume) your index was not sharded, but in this case,
you might have enough memory on each node to do that.

If not, its quite easy to implement your own scrolling. You can execute a
"match_all" search with a range filter of ids (1-10000, 10000-20000) and cap
the size returned with the worse case scenario batch size (in the example I
gave, 10000), and do the verification process on a per batch against the db.
Note that the verification process of each batch (search, fetch from db,
verify) can easily be parallelized.

My hunch is that you will find that this is a good solution (not sure

what you are after in terms of time limit on the verification process, you
will need to test it out, but it does not sound like it needs to be extra
fast). Its certainly much better than trying to work directly against the
indices elasticsearch stores locally, since do tend to move around ;).

-shay.banon

On Tue, Sep 7, 2010 at 5:35 AM, tallpsmith tallpsmith@gmail.com wrote:

Hi, I've been circling around looking at Elasticsearch to replace some
home grown infrastructure around Lucene we built back in 2005. Way
back then we had the requirement that index updates had to be visibile
in a large index (many Gb's) within 5 seconds of the change, so
effectively we came up with our own version of NRT. We've been
waiting for the Lucene community to come up with their NRT, because
frankly, ours is way too hacky.. :slight_smile: I see Elasticsearch has the same
capability, perhaps based out of the post Lucene 2.9 code, or not.

One of the features we need is a way to verify the state of an index.
Since all our contents 'truth' stems from our database, we have
historically indexed a 'lastupdated' timestamp. We have triggers on
our DB tables so that the last modification in the DB is automatically
tracked, and any 'missed update' to the index can easily be determined
by comparing an items lastupdated stamp from the db compared with the
value stored in the index.

We index hundreds of millions of items, so by design we use a merge-
stream style approach for the verification to proceed quickly. We
pull the ID of the item, and it's lastupdated value from the DB in a
result set stream, sorted by ID lexicographically (more on that
later), then join this tream with one by walking the ID field term
docs/enum (lucene-based API) and pull out the lastupdated storied
field value. We then join the streams together looking for holes
(items in the db not in the index, or items still in the index that
have been deleted in the DB, or if there are timestamp mismatches).
this is the reason the id is sorted lexicographically because that's
the way we walk the term values as they are natively stored.

we deliberately don't use Lucene searches here because we need all
results for comparison, and Lucene is not great at doing searches and
returning large results back (the internal PriorityQueue causes memory
issues because it has to allocate the size of the result ahead of
time). The retrieval of each items lastupdated stamp from the stored
field area causes issues during HitCollection too anyway, it's
actually more efficient to scan the id Field term docs and pull it out
that way.

So with that in mind, I went to see how elasticSearch stores it's
indexes to see how we'd go about being able to do the same thing. The
API is really neat so I'd love to be able to do this directly via the
api and not go down low, raw, and inspect the index on disk outside of
Elasticsearch if I can, so looking for an idea on how to do this
efficiently.

With the sharding, and with the way a lot of in-memory magic is done,
I'm not exactly sure on the best way of demonstrating a proof of
concept index verification process. This also may be because
Elasticsearch is using Lucene 3.0.2, of which I'm not familiar (ours
is based on older 2.4 code).

any ideas, pointers to API would be appreciated on how I could go
about this.

cheers,

Paul Smith

On Sep 7, 5:45 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

I think that you should be ok with doing a search and then scrolling it,
fetching back just the id and the lastupdate fields (if the id is the same
as the "doc" id, then you get it back for free, with the lastupdate, you can
either store it, or get it loaded into memory ala field cache). You can use
query+fetch, and since each hit is small (you don't fetch the _source back),
you can get thousands of results for each scrolling action.

in our case, the id field is our table primary key, so it'll be
another field to retrieve, but that's no biggie really.

The question here is if you should use elasticsearch scrolling support or
do it yourself. As you pointed out, the problem might be if the priority
queue built for each search scroll might be too big. In your case, it might
have been, because (I assume) your index was not sharded, but in this case,
you might have enough memory on each node to do that.

yes, no sharding for us. While we have our own home grown NRT, it's
'sharded' by having a separate index per logical grouping ('project'
in our case). A single index though could contain many millions of
records though. We were originally thinking of combining these
project indexes together to take advantage of term efficiencies
(particularly given the # projects is now getting pretty large, the
FieldCache memory for sorting is duplicated for many repeating terms),
so it's nice to see that elasticsearch does behind the scenes sharding
of a logical index anyway.

If not, its quite easy to implement your own scrolling. You can execute a
"match_all" search with a range filter of ids (1-10000, 10000-20000) and cap
the size returned with the worse case scenario batch size (in the example I
gave, 10000), and do the verification process on a per batch against the db.
Note that the verification process of each batch (search, fetch from db,
verify) can easily be parallelized.

Yes, this partitioning would work, and if we combined our project
indexes into one logical elasticsearch index (with behind the scene
sharding) the distribution of id's would be fairly smooth. Only in
the cases where there's a decent percentage of deletes of rows (not
common in our case) where the distribution of id space may cause holes
in the distribution, so some partitions would be smaller than others,
but probably not a big deal really, more academic. The
parallelization is nice here.

My hunch is that you will find that this is a good solution (not sure

what you are after in terms of time limit on the verification process, you
will need to test it out, but it does not sound like it needs to be extra
fast). Its certainly much better than trying to work directly against the
indices elasticsearch stores locally, since do tend to move around ;).

the trick here will be balancing what the id partitioning size is to
ensure the overall # roundtrips back to the elasticsearch server isn't
too frequent with the overhead of that slowing things down, versus the
total load (CPU and memory) on the server. I've got something to
mockup and do a proof of concept anyway.

The only complication here is the sorting by id value. the issue with
such a unique value like an id when sorting is that the FieldCache
values consume lots of memory (in fact why our hacky NRT was hard
because with frequent updates the IndexReader reopen was a killer
otherwise).

Many thanks Shay,

Paul

On Wed, Sep 8, 2010 at 7:22 AM, tallpsmith tallpsmith@gmail.com wrote:

On Sep 7, 5:45 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi,

I think that you should be ok with doing a search and then scrolling
it,
fetching back just the id and the lastupdate fields (if the id is the
same
as the "doc" id, then you get it back for free, with the lastupdate, you
can
either store it, or get it loaded into memory ala field cache). You can
use
query+fetch, and since each hit is small (you don't fetch the _source
back),
you can get thousands of results for each scrolling action.

in our case, the id field is our table primary key, so it'll be
another field to retrieve, but that's no biggie really.

The question here is if you should use elasticsearch scrolling support
or
do it yourself. As you pointed out, the problem might be if the priority
queue built for each search scroll might be too big. In your case, it
might
have been, because (I assume) your index was not sharded, but in this
case,
you might have enough memory on each node to do that.

yes, no sharding for us. While we have our own home grown NRT, it's
'sharded' by having a separate index per logical grouping ('project'
in our case). A single index though could contain many millions of
records though. We were originally thinking of combining these
project indexes together to take advantage of term efficiencies
(particularly given the # projects is now getting pretty large, the
FieldCache memory for sorting is duplicated for many repeating terms),
so it's nice to see that elasticsearch does behind the scenes sharding
of a logical index anyway.

Yep, elasticsearch shards an index, and you can create several indices
dynamically. If you add more machines, shards gets balanced around across
the cluster.

If not, its quite easy to implement your own scrolling. You can
execute a
"match_all" search with a range filter of ids (1-10000, 10000-20000) and
cap
the size returned with the worse case scenario batch size (in the example
I
gave, 10000), and do the verification process on a per batch against the
db.
Note that the verification process of each batch (search, fetch from db,
verify) can easily be parallelized.

Yes, this partitioning would work, and if we combined our project
indexes into one logical elasticsearch index (with behind the scene
sharding) the distribution of id's would be fairly smooth. Only in
the cases where there's a decent percentage of deletes of rows (not
common in our case) where the distribution of id space may cause holes
in the distribution, so some partitions would be smaller than others,
but probably not a big deal really, more academic. The
parallelization is nice here.

the distribution of your secondary ids is not that relevant (I think). When
indexing, the document gets hashed (based on its id) to a shard. When
searching, a request is sent to all shards of an index, and it returns what
it needs to return.

My hunch is that you will find that this is a good solution (not sure

what you are after in terms of time limit on the verification process,
you
will need to test it out, but it does not sound like it needs to be extra
fast). Its certainly much better than trying to work directly against the
indices elasticsearch stores locally, since do tend to move around ;).

the trick here will be balancing what the id partitioning size is to
ensure the overall # roundtrips back to the elasticsearch server isn't
too frequent with the overhead of that slowing things down, versus the
total load (CPU and memory) on the server. I've got something to
mockup and do a proof of concept anyway.

Yea, you will need to check it out. But even with 100 results (and not
10000) it should be relatively fast.

The only complication here is the sorting by id value. the issue with
such a unique value like an id when sorting is that the FieldCache
values consume lots of memory (in fact why our hacky NRT was hard
because with frequent updates the IndexReader reopen was a killer
otherwise).

There is a big change in how Lucene works. Index segments that do not change
will not be reopened and thus FieldCache will not be refreshed for it (with
reopen).

Many thanks Shay,

Paul