Hi, I've been circling around looking at ElasticSearch to replace some
home grown infrastructure around Lucene we built back in 2005. Way
back then we had the requirement that index updates had to be visibile
in a large index (many Gb's) within 5 seconds of the change, so
effectively we came up with our own version of NRT. We've been
waiting for the Lucene community to come up with their NRT, because
frankly, ours is way too hacky.. I see ElasticSearch has the same
capability, perhaps based out of the post Lucene 2.9 code, or not.
One of the features we need is a way to verify the state of an index.
Since all our contents 'truth' stems from our database, we have
historically indexed a 'lastupdated' timestamp. We have triggers on
our DB tables so that the last modification in the DB is automatically
tracked, and any 'missed update' to the index can easily be determined
by comparing an items lastupdated stamp from the db compared with the
value stored in the index.
We index hundreds of millions of items, so by design we use a merge-
stream style approach for the verification to proceed quickly. We
pull the ID of the item, and it's lastupdated value from the DB in a
result set stream, sorted by ID lexicographically (more on that
later), then join this tream with one by walking the ID field term
docs/enum (lucene-based API) and pull out the lastupdated storied
field value. We then join the streams together looking for holes
(items in the db not in the index, or items still in the index that
have been deleted in the DB, or if there are timestamp mismatches).
this is the reason the id is sorted lexicographically because that's
the way we walk the term values as they are natively stored.
we deliberately don't use Lucene searches here because we need all
results for comparison, and Lucene is not great at doing searches and
returning large results back (the internal PriorityQueue causes memory
issues because it has to allocate the size of the result ahead of
time). The retrieval of each items lastupdated stamp from the stored
field area causes issues during HitCollection too anyway, it's
actually more efficient to scan the id Field term docs and pull it out
that way.
So with that in mind, I went to see how elasticSearch stores it's
indexes to see how we'd go about being able to do the same thing. The
API is really neat so I'd love to be able to do this directly via the
api and not go down low, raw, and inspect the index on disk outside of
ElasticSearch if I can, so looking for an idea on how to do this
efficiently.
With the sharding, and with the way a lot of in-memory magic is done,
I'm not exactly sure on the best way of demonstrating a proof of
concept index verification process. This also may be because
ElasticSearch is using Lucene 3.0.2, of which I'm not familiar (ours
is based on older 2.4 code).
any ideas, pointers to API would be appreciated on how I could go
about this.
cheers,
Paul Smith