I thought I would share with the ES community our plans to build and
open-source an "Index Verification" application. This message will probably
fall into the 'tl;dr' but hopefully not!
Many of you will have experienced some sort of 'data loss'/corruption; where
the Index state is not consistent with some external point-of-truth (say,
your database containing the stuff your indexing). Not necessarily because
of ElasticSearch (although perhaps... ), but because eventually, sh1t
happens. Bug in your application, random weird EC2 issue, OOM, geez,
there's just a bucket load of conditions that can happen.
At any point in time, how do you know if your index is correct? If your
index is large, reindexing it all just may take far too long. If you knew
what items were wrong it may be quicker to recover that info by reindexing
just those bits. This is predicated on the idea that checking your index
state is significantly faster than reindexing, but from experience, I think
that's true. In an ideal world there are never any errors, but I don't
think we live in an ideal world!
There are some applications where the criticality of the correctness of the
index is imperative, decisions are made based on the information returned
(not necessarily just about 'finding' stuff, but by what the results as a
whole tell you). In the case of a Disaster Recovery, I'd think we'd want to
be pretty sure ones index is good to go.
I originally posted a question about how one could solve this with ES, you
can see my original post in  outlined at the end.
We have an Index Verifier application built in-house that worked with our
initial custom index framework we built many years ago. Every day it
checked 145 million records for correctness. But it's now time to move to ES
and we still need an Index Verifier, and I'd think the entire ES community
could benefit from it too, collaborate and improve on it as a shared
The basic principle is to use the ID, and a 'timestamp' as a version (an
epoch time stored as a long) signature for each record. Every time a row in
your data source (DB) is modified, the timestamp is updated automatically,
say by a DB trigger. We call this 'lastupdated'. When a record is
modified, the index is updated, and also stores this lastupdated timestamp
signature. You have extremely high confidence that your index information
is correct for that item if the timestamps match between your source and the
ElasticSearch's built-in support for the ID and 'version' concept is
fantastic. By using the 'external' property of the version
( ..setVersion(timestamp)).setVersionType(VersionType.EXTERNAL) ... in Java
API speak) you lock in this information with the record in ES.
Index Verifier then takes a sorted stream of ID:Timestamp tuples from your
source of truth and does a comparison against the stream from ES. Walking
the streams it looks for gaps - missing IDs in the index, IDs that have been
deleted in the source stream, but were correspondingly not deleted in the
index, or matching IDs with a mismatch in timestamps. Since you probably
have an index on your ID column in your db, retrieving that stream in sorted
from the DB is trivial and fast.
Getting the stream sorted from ElasticSearch is the kicker. Retrieving
AND sorting large number of results is not efficient in ES (or any
Lucene-based framework) because Lucene relies on a PriorityQueue for
sorting, and a large result is very inefficient here (LOTS of memory).
Enter the _scan API ElasticSearch has.
My initial testing of an index with 10 million records on a simple 2 node
cluster with vanilla config, using the _scan API to retrieve batches of
100,000 records from ES just retrieving the ID and Timestamp, and writing it
to a local file, then doing an external merge stream sort efficiently took ~
62 seconds on my MBP, roughly average of 150,000/second. The merge-sort took
an only an extra 9 seconds. It took a bit over 7 minutes to index this
relatively simple index (no replicas) and as an index grows in complexity,
usually so does the indexing time, so you can hopefully see the Index
Verification process is a nice quick way to check the state.
This is what we call a "Full" Index Verification. There is a way to perform
a "Partial" verification by only looking at changes from the source side
since a certain timestamp. By adding a trigger on Delete on the source(DB)
side to track when a row is deleted and keeping an ID:Timestamp tuple as a
companion table, you can retrieve with some crafty SQL a stream of
ID:timestamp: tuples from your DB that contain a
sorted stream since a particular time period, including when a row was
deleted. You can retrieve from ES the same by using the _scan API again
with a filter based on the timestamp. Walking the now much smaller sorted
streams matching them up, looking for IDs not in the index, or deletes that
never made it, or mismatched timestamp signatures. This is why having a
timestamp is a nice 'version' signature, because it has a temporal property
useful in this scenario. If you do one Full verification, and then a
Partial run that overlaps the time window when you last did the full, you
can have confidence of the index state of recent changes.
Just like many filesystems, this partial mode could be run frequently during
the day to 'scrub' any off data, just like some filesystems scrub looking
for any errors more frequently, rather than doing one large one.
There's opportunities to provide a range of input source types (DB, HBase,
MongoDB etc) by a simple plugin system. We're specifically starting off
with the DB in mind, since that's our use case and is probably pretty
common, but a community effort around this could produce something pretty
cool. Honestly, no reason the 'check' source could be something other than
ElasticSearch, could be useful for Solr too.
We'll be putting it up on GitHub under an ASL 2 license, we'll let you know
more once it's arrived via this channel. I would like to hear whether there
are people in using this application, and maybe whether anyone is interested
in collaboration on it.
 Index Verification - original thread