[Proposal] Index Verification

I thought I would share with the ES community our plans to build and
open-source an "Index Verification" application. This message will probably
fall into the 'tl;dr' but hopefully not!

Many of you will have experienced some sort of 'data loss'/corruption; where
the Index state is not consistent with some external point-of-truth (say,
your database containing the stuff your indexing). Not necessarily because
of ElasticSearch (although perhaps... :slight_smile: ), but because eventually, sh1t
happens. Bug in your application, random weird EC2 issue, OOM, geez,
there's just a bucket load of conditions that can happen.

At any point in time, how do you know if your index is correct? If your
index is large, reindexing it all just may take far too long. If you knew
what items were wrong it may be quicker to recover that info by reindexing
just those bits. This is predicated on the idea that checking your index
state is significantly faster than reindexing, but from experience, I think
that's true. In an ideal world there are never any errors, but I don't
think we live in an ideal world! :slight_smile:

There are some applications where the criticality of the correctness of the
index is imperative, decisions are made based on the information returned
(not necessarily just about 'finding' stuff, but by what the results as a
whole tell you). In the case of a Disaster Recovery, I'd think we'd want to
be pretty sure ones index is good to go.

I originally posted a question about how one could solve this with ES, you
can see my original post in [1] outlined at the end.

We have an Index Verifier application built in-house that worked with our
initial custom index framework we built many years ago. Every day it
checked 145 million records for correctness. But it's now time to move to ES
and we still need an Index Verifier, and I'd think the entire ES community
could benefit from it too, collaborate and improve on it as a shared
project.

The basic principle is to use the ID, and a 'timestamp' as a version (an
epoch time stored as a long) signature for each record. Every time a row in
your data source (DB) is modified, the timestamp is updated automatically,
say by a DB trigger. We call this 'lastupdated'. When a record is
modified, the index is updated, and also stores this lastupdated timestamp
signature. You have extremely high confidence that your index information
is correct for that item if the timestamps match between your source and the
index version.

ElasticSearch's built-in support for the ID and 'version' concept is
fantastic. By using the 'external' property of the version
( ..setVersion(timestamp)).setVersionType(VersionType.EXTERNAL) ... in Java
API speak) you lock in this information with the record in ES.

Index Verifier then takes a sorted stream of ID:Timestamp tuples from your
source of truth and does a comparison against the stream from ES. Walking
the streams it looks for gaps - missing IDs in the index, IDs that have been
deleted in the source stream, but were correspondingly not deleted in the
index, or matching IDs with a mismatch in timestamps. Since you probably
have an index on your ID column in your db, retrieving that stream in sorted
from the DB is trivial and fast.

Getting the stream sorted from ElasticSearch is the kicker. Retrieving
AND sorting large number of results is not efficient in ES (or any
Lucene-based framework) because Lucene relies on a PriorityQueue for
sorting, and a large result is very inefficient here (LOTS of memory).
Enter the _scan API ElasticSearch has.

My initial testing of an index with 10 million records on a simple 2 node
cluster with vanilla config, using the _scan API to retrieve batches of
100,000 records from ES just retrieving the ID and Timestamp, and writing it
to a local file, then doing an external merge stream sort efficiently took ~
62 seconds on my MBP, roughly average of 150,000/second. The merge-sort took
an only an extra 9 seconds. It took a bit over 7 minutes to index this
relatively simple index (no replicas) and as an index grows in complexity,
usually so does the indexing time, so you can hopefully see the Index
Verification process is a nice quick way to check the state.

This is what we call a "Full" Index Verification. There is a way to perform
a "Partial" verification by only looking at changes from the source side
since a certain timestamp. By adding a trigger on Delete on the source(DB)
side to track when a row is deleted and keeping an ID:Timestamp tuple as a
companion table, you can retrieve with some crafty SQL a stream of
ID:timestamp: tuples from your DB that contain a
sorted stream since a particular time period, including when a row was
deleted. You can retrieve from ES the same by using the _scan API again
with a filter based on the timestamp. Walking the now much smaller sorted
streams matching them up, looking for IDs not in the index, or deletes that
never made it, or mismatched timestamp signatures. This is why having a
timestamp is a nice 'version' signature, because it has a temporal property
useful in this scenario. If you do one Full verification, and then a
Partial run that overlaps the time window when you last did the full, you
can have confidence of the index state of recent changes.

Just like many filesystems, this partial mode could be run frequently during
the day to 'scrub' any off data, just like some filesystems scrub looking
for any errors more frequently, rather than doing one large one.

There's opportunities to provide a range of input source types (DB, HBase,
MongoDB etc) by a simple plugin system. We're specifically starting off
with the DB in mind, since that's our use case and is probably pretty
common, but a community effort around this could produce something pretty
cool. Honestly, no reason the 'check' source could be something other than
ElasticSearch, could be useful for Solr too.

We'll be putting it up on GitHub under an ASL 2 license, we'll let you know
more once it's arrived via this channel. I would like to hear whether there
are people in using this application, and maybe whether anyone is interested
in collaboration on it.

cheers,

Paul Smith

[1] Index Verification - original thread
http://elasticsearch-users.115913.n3.nabble.com/Index-Verification-td1430219.html

I will definitely have a look at it, thanks for sharing it, and I will try
and think of ways to try and simplify the process on es side...

I just love it how the ecosystem around elasticsearch is growing, and I
doubly appreciate the effort people put at open sourcing things. Its a lot
of effort to do...

On Sun, Aug 14, 2011 at 6:34 AM, Paul Smith tallpsmith@gmail.com wrote:

I thought I would share with the ES community our plans to build and
open-source an "Index Verification" application. This message will probably
fall into the 'tl;dr' but hopefully not!

Many of you will have experienced some sort of 'data loss'/corruption;
where the Index state is not consistent with some external point-of-truth
(say, your database containing the stuff your indexing). Not necessarily
because of Elasticsearch (although perhaps... :slight_smile: ), but because eventually,
sh1t happens. Bug in your application, random weird EC2 issue, OOM, geez,
there's just a bucket load of conditions that can happen.

At any point in time, how do you know if your index is correct? If your
index is large, reindexing it all just may take far too long. If you knew
what items were wrong it may be quicker to recover that info by reindexing
just those bits. This is predicated on the idea that checking your index
state is significantly faster than reindexing, but from experience, I think
that's true. In an ideal world there are never any errors, but I don't
think we live in an ideal world! :slight_smile:

There are some applications where the criticality of the correctness of the
index is imperative, decisions are made based on the information returned
(not necessarily just about 'finding' stuff, but by what the results as a
whole tell you). In the case of a Disaster Recovery, I'd think we'd want to
be pretty sure ones index is good to go.

I originally posted a question about how one could solve this with ES, you
can see my original post in [1] outlined at the end.

We have an Index Verifier application built in-house that worked with our
initial custom index framework we built many years ago. Every day it
checked 145 million records for correctness. But it's now time to move to ES
and we still need an Index Verifier, and I'd think the entire ES community
could benefit from it too, collaborate and improve on it as a shared
project.

The basic principle is to use the ID, and a 'timestamp' as a version (an
epoch time stored as a long) signature for each record. Every time a row in
your data source (DB) is modified, the timestamp is updated automatically,
say by a DB trigger. We call this 'lastupdated'. When a record is
modified, the index is updated, and also stores this lastupdated timestamp
signature. You have extremely high confidence that your index information
is correct for that item if the timestamps match between your source and the
index version.

Elasticsearch's built-in support for the ID and 'version' concept is
fantastic. By using the 'external' property of the version
( ..setVersion(timestamp)).setVersionType(VersionType.EXTERNAL) ... in Java
API speak) you lock in this information with the record in ES.

Index Verifier then takes a sorted stream of ID:Timestamp tuples from your
source of truth and does a comparison against the stream from ES. Walking
the streams it looks for gaps - missing IDs in the index, IDs that have been
deleted in the source stream, but were correspondingly not deleted in the
index, or matching IDs with a mismatch in timestamps. Since you probably
have an index on your ID column in your db, retrieving that stream in sorted
from the DB is trivial and fast.

Getting the stream sorted from Elasticsearch is the kicker. Retrieving
AND sorting large number of results is not efficient in ES (or any
Lucene-based framework) because Lucene relies on a PriorityQueue for
sorting, and a large result is very inefficient here (LOTS of memory).
Enter the _scan API Elasticsearch has.

My initial testing of an index with 10 million records on a simple 2 node
cluster with vanilla config, using the _scan API to retrieve batches of
100,000 records from ES just retrieving the ID and Timestamp, and writing it
to a local file, then doing an external merge stream sort efficiently took ~
62 seconds on my MBP, roughly average of 150,000/second. The merge-sort took
an only an extra 9 seconds. It took a bit over 7 minutes to index this
relatively simple index (no replicas) and as an index grows in complexity,
usually so does the indexing time, so you can hopefully see the Index
Verification process is a nice quick way to check the state.

This is what we call a "Full" Index Verification. There is a way to
perform a "Partial" verification by only looking at changes from the source
side since a certain timestamp. By adding a trigger on Delete on the
source(DB) side to track when a row is deleted and keeping an ID:Timestamp
tuple as a companion table, you can retrieve with some crafty SQL a stream
of ID:timestamp: tuples from your DB that contain a
sorted stream since a particular time period, including when a row was
deleted. You can retrieve from ES the same by using the _scan API again
with a filter based on the timestamp. Walking the now much smaller sorted
streams matching them up, looking for IDs not in the index, or deletes that
never made it, or mismatched timestamp signatures. This is why having a
timestamp is a nice 'version' signature, because it has a temporal property
useful in this scenario. If you do one Full verification, and then a
Partial run that overlaps the time window when you last did the full, you
can have confidence of the index state of recent changes.

Just like many filesystems, this partial mode could be run frequently
during the day to 'scrub' any off data, just like some filesystems scrub
looking for any errors more frequently, rather than doing one large one.

There's opportunities to provide a range of input source types (DB, HBase,
MongoDB etc) by a simple plugin system. We're specifically starting off
with the DB in mind, since that's our use case and is probably pretty
common, but a community effort around this could produce something pretty
cool. Honestly, no reason the 'check' source could be something other than
Elasticsearch, could be useful for Solr too.

We'll be putting it up on GitHub under an ASL 2 license, we'll let you know
more once it's arrived via this channel. I would like to hear whether there
are people in using this application, and maybe whether anyone is interested
in collaboration on it.

cheers,

Paul Smith

[1] Index Verification - original thread

http://elasticsearch-users.115913.n3.nabble.com/Index-Verification-td1430219.html

Hi,

I'll admit to skimming through the second half of the message, but...
doesn't Zoie (ASL 2 I believe) have this or something like this
already? I'm mentioning this simply because if it has this, it would
make sense to look at it first and borrow.

Otis

Sematext is hiring Search Engineers -- Jobs - Sematext

On Aug 13, 11:34 pm, Paul Smith tallpsm...@gmail.com wrote:

I thought I would share with the ES community our plans to build and
open-source an "Index Verification" application. This message will probably
fall into the 'tl;dr' but hopefully not!

Many of you will have experienced some sort of 'data loss'/corruption; where
the Index state is not consistent with some external point-of-truth (say,
your database containing the stuff your indexing). Not necessarily because
of Elasticsearch (although perhaps... :slight_smile: ), but because eventually, sh1t
happens. Bug in your application, random weird EC2 issue, OOM, geez,
there's just a bucket load of conditions that can happen.

At any point in time, how do you know if your index is correct? If your
index is large, reindexing it all just may take far too long. If you knew
what items were wrong it may be quicker to recover that info by reindexing
just those bits. This is predicated on the idea that checking your index
state is significantly faster than reindexing, but from experience, I think
that's true. In an ideal world there are never any errors, but I don't
think we live in an ideal world! :slight_smile:

There are some applications where the criticality of the correctness of the
index is imperative, decisions are made based on the information returned
(not necessarily just about 'finding' stuff, but by what the results as a
whole tell you). In the case of a Disaster Recovery, I'd think we'd want to
be pretty sure ones index is good to go.

I originally posted a question about how one could solve this with ES, you
can see my original post in [1] outlined at the end.

We have an Index Verifier application built in-house that worked with our
initial custom index framework we built many years ago. Every day it
checked 145 million records for correctness. But it's now time to move to ES
and we still need an Index Verifier, and I'd think the entire ES community
could benefit from it too, collaborate and improve on it as a shared
project.

The basic principle is to use the ID, and a 'timestamp' as a version (an
epoch time stored as a long) signature for each record. Every time a row in
your data source (DB) is modified, the timestamp is updated automatically,
say by a DB trigger. We call this 'lastupdated'. When a record is
modified, the index is updated, and also stores this lastupdated timestamp
signature. You have extremely high confidence that your index information
is correct for that item if the timestamps match between your source and the
index version.

Elasticsearch's built-in support for the ID and 'version' concept is
fantastic. By using the 'external' property of the version
( ..setVersion(timestamp)).setVersionType(VersionType.EXTERNAL) ... in Java
API speak) you lock in this information with the record in ES.

Index Verifier then takes a sorted stream of ID:Timestamp tuples from your
source of truth and does a comparison against the stream from ES. Walking
the streams it looks for gaps - missing IDs in the index, IDs that have been
deleted in the source stream, but were correspondingly not deleted in the
index, or matching IDs with a mismatch in timestamps. Since you probably
have an index on your ID column in your db, retrieving that stream in sorted
from the DB is trivial and fast.

Getting the stream sorted from Elasticsearch is the kicker. Retrieving
AND sorting large number of results is not efficient in ES (or any
Lucene-based framework) because Lucene relies on a PriorityQueue for
sorting, and a large result is very inefficient here (LOTS of memory).
Enter the _scan API Elasticsearch has.

My initial testing of an index with 10 million records on a simple 2 node
cluster with vanilla config, using the _scan API to retrieve batches of
100,000 records from ES just retrieving the ID and Timestamp, and writing it
to a local file, then doing an external merge stream sort efficiently took ~
62 seconds on my MBP, roughly average of 150,000/second. The merge-sort took
an only an extra 9 seconds. It took a bit over 7 minutes to index this
relatively simple index (no replicas) and as an index grows in complexity,
usually so does the indexing time, so you can hopefully see the Index
Verification process is a nice quick way to check the state.

This is what we call a "Full" Index Verification. There is a way to perform
a "Partial" verification by only looking at changes from the source side
since a certain timestamp. By adding a trigger on Delete on the source(DB)
side to track when a row is deleted and keeping an ID:Timestamp tuple as a
companion table, you can retrieve with some crafty SQL a stream of
ID:timestamp: tuples from your DB that contain a
sorted stream since a particular time period, including when a row was
deleted. You can retrieve from ES the same by using the _scan API again
with a filter based on the timestamp. Walking the now much smaller sorted
streams matching them up, looking for IDs not in the index, or deletes that
never made it, or mismatched timestamp signatures. This is why having a
timestamp is a nice 'version' signature, because it has a temporal property
useful in this scenario. If you do one Full verification, and then a
Partial run that overlaps the time window when you last did the full, you
can have confidence of the index state of recent changes.

Just like many filesystems, this partial mode could be run frequently during
the day to 'scrub' any off data, just like some filesystems scrub looking
for any errors more frequently, rather than doing one large one.

There's opportunities to provide a range of input source types (DB, HBase,
MongoDB etc) by a simple plugin system. We're specifically starting off
with the DB in mind, since that's our use case and is probably pretty
common, but a community effort around this could produce something pretty
cool. Honestly, no reason the 'check' source could be something other than
Elasticsearch, could be useful for Solr too.

We'll be putting it up on GitHub under an ASL 2 license, we'll let you know
more once it's arrived via this channel. I would like to hear whether there
are people in using this application, and maybe whether anyone is interested
in collaboration on it.

cheers,

Paul Smith

[1] Index Verification - original threadhttp://elasticsearch-users.115913.n3.nabble.com/Index-Verification-td...

Otis,

Hmmm, I did a scan of the Zoie docs and for the life of me I can't see it,
but that doesn't mean it's not there.. ?

I would clearly rather NOT build this if I can borrow something already
there and being used.. ? Can anyone else find it in the Zoie stuff? If
this sort of thing does exist somewhere else I would really rather join that
than start another one.

On 16 August 2011 10:13, Otis Gospodnetic otis.gospodnetic@gmail.comwrote:

Hi,

I'll admit to skimming through the second half of the message, but...
doesn't Zoie (ASL 2 I believe) have this or something like this
already? I'm mentioning this simply because if it has this, it would
make sense to look at it first and borrow.

Otis

Sematext is hiring Search Engineers -- Jobs - Sematext

On Aug 13, 11:34 pm, Paul Smith tallpsm...@gmail.com wrote:

I thought I would share with the ES community our plans to build and
open-source an "Index Verification" application. This message will
probably
fall into the 'tl;dr' but hopefully not!

Many of you will have experienced some sort of 'data loss'/corruption;
where
the Index state is not consistent with some external point-of-truth (say,
your database containing the stuff your indexing). Not necessarily
because
of Elasticsearch (although perhaps... :slight_smile: ), but because eventually, sh1t
happens. Bug in your application, random weird EC2 issue, OOM, geez,
there's just a bucket load of conditions that can happen.

At any point in time, how do you know if your index is correct? If your
index is large, reindexing it all just may take far too long. If you
knew
what items were wrong it may be quicker to recover that info by
reindexing
just those bits. This is predicated on the idea that checking your index
state is significantly faster than reindexing, but from experience, I
think
that's true. In an ideal world there are never any errors, but I don't
think we live in an ideal world! :slight_smile:

There are some applications where the criticality of the correctness of
the
index is imperative, decisions are made based on the information returned
(not necessarily just about 'finding' stuff, but by what the results as a
whole tell you). In the case of a Disaster Recovery, I'd think we'd want
to
be pretty sure ones index is good to go.

I originally posted a question about how one could solve this with ES,
you
can see my original post in [1] outlined at the end.

We have an Index Verifier application built in-house that worked with our
initial custom index framework we built many years ago. Every day it
checked 145 million records for correctness. But it's now time to move to
ES
and we still need an Index Verifier, and I'd think the entire ES
community
could benefit from it too, collaborate and improve on it as a shared
project.

The basic principle is to use the ID, and a 'timestamp' as a version (an
epoch time stored as a long) signature for each record. Every time a row
in
your data source (DB) is modified, the timestamp is updated
automatically,
say by a DB trigger. We call this 'lastupdated'. When a record is
modified, the index is updated, and also stores this lastupdated
timestamp
signature. You have extremely high confidence that your index
information
is correct for that item if the timestamps match between your source and
the
index version.

Elasticsearch's built-in support for the ID and 'version' concept is
fantastic. By using the 'external' property of the version
( ..setVersion(timestamp)).setVersionType(VersionType.EXTERNAL) ... in
Java
API speak) you lock in this information with the record in ES.

Index Verifier then takes a sorted stream of ID:Timestamp tuples from
your
source of truth and does a comparison against the stream from ES.
Walking
the streams it looks for gaps - missing IDs in the index, IDs that have
been
deleted in the source stream, but were correspondingly not deleted in the
index, or matching IDs with a mismatch in timestamps. Since you probably
have an index on your ID column in your db, retrieving that stream in
sorted
from the DB is trivial and fast.

Getting the stream sorted from Elasticsearch is the kicker. Retrieving
AND sorting large number of results is not efficient in ES (or any
Lucene-based framework) because Lucene relies on a PriorityQueue for
sorting, and a large result is very inefficient here (LOTS of memory).
Enter the _scan API Elasticsearch has.

My initial testing of an index with 10 million records on a simple 2 node
cluster with vanilla config, using the _scan API to retrieve batches of
100,000 records from ES just retrieving the ID and Timestamp, and writing
it
to a local file, then doing an external merge stream sort efficiently
took ~
62 seconds on my MBP, roughly average of 150,000/second. The merge-sort
took
an only an extra 9 seconds. It took a bit over 7 minutes to index this
relatively simple index (no replicas) and as an index grows in
complexity,
usually so does the indexing time, so you can hopefully see the Index
Verification process is a nice quick way to check the state.

This is what we call a "Full" Index Verification. There is a way to
perform
a "Partial" verification by only looking at changes from the source side
since a certain timestamp. By adding a trigger on Delete on the
source(DB)
side to track when a row is deleted and keeping an ID:Timestamp tuple as
a
companion table, you can retrieve with some crafty SQL a stream of
ID:timestamp: tuples from your DB that contain a
sorted stream since a particular time period, including when a row was
deleted. You can retrieve from ES the same by using the _scan API again
with a filter based on the timestamp. Walking the now much smaller
sorted
streams matching them up, looking for IDs not in the index, or deletes
that
never made it, or mismatched timestamp signatures. This is why having a
timestamp is a nice 'version' signature, because it has a temporal
property
useful in this scenario. If you do one Full verification, and then a
Partial run that overlaps the time window when you last did the full, you
can have confidence of the index state of recent changes.

Just like many filesystems, this partial mode could be run frequently
during
the day to 'scrub' any off data, just like some filesystems scrub looking
for any errors more frequently, rather than doing one large one.

There's opportunities to provide a range of input source types (DB,
HBase,
MongoDB etc) by a simple plugin system. We're specifically starting off
with the DB in mind, since that's our use case and is probably pretty
common, but a community effort around this could produce something pretty
cool. Honestly, no reason the 'check' source could be something other
than
Elasticsearch, could be useful for Solr too.

We'll be putting it up on GitHub under an ASL 2 license, we'll let you
know
more once it's arrived via this channel. I would like to hear whether
there
are people in using this application, and maybe whether anyone is
interested
in collaboration on it.

cheers,

Paul Smith

[1] Index Verification - original threadhttp://
elasticsearch-users.115913.n3.nabble.com/Index-Verification-td...

As far as I know, Zoie does not provide it.

On Tue, Aug 16, 2011 at 3:41 AM, Paul Smith tallpsmith@gmail.com wrote:

Otis,

Hmmm, I did a scan of the Zoie docs and for the life of me I can't see it,
but that doesn't mean it's not there.. ?

I would clearly rather NOT build this if I can borrow something already
there and being used.. ? Can anyone else find it in the Zoie stuff? If
this sort of thing does exist somewhere else I would really rather join that
than start another one.

On 16 August 2011 10:13, Otis Gospodnetic otis.gospodnetic@gmail.comwrote:

Hi,

I'll admit to skimming through the second half of the message, but...
doesn't Zoie (ASL 2 I believe) have this or something like this
already? I'm mentioning this simply because if it has this, it would
make sense to look at it first and borrow.

Otis

Sematext is hiring Search Engineers --
Jobs - Sematext

On Aug 13, 11:34 pm, Paul Smith tallpsm...@gmail.com wrote:

I thought I would share with the ES community our plans to build and
open-source an "Index Verification" application. This message will
probably
fall into the 'tl;dr' but hopefully not!

Many of you will have experienced some sort of 'data loss'/corruption;
where
the Index state is not consistent with some external point-of-truth
(say,
your database containing the stuff your indexing). Not necessarily
because
of Elasticsearch (although perhaps... :slight_smile: ), but because eventually, sh1t
happens. Bug in your application, random weird EC2 issue, OOM, geez,
there's just a bucket load of conditions that can happen.

At any point in time, how do you know if your index is correct? If your
index is large, reindexing it all just may take far too long. If you
knew
what items were wrong it may be quicker to recover that info by
reindexing
just those bits. This is predicated on the idea that checking your
index
state is significantly faster than reindexing, but from experience, I
think
that's true. In an ideal world there are never any errors, but I don't
think we live in an ideal world! :slight_smile:

There are some applications where the criticality of the correctness of
the
index is imperative, decisions are made based on the information
returned
(not necessarily just about 'finding' stuff, but by what the results as
a
whole tell you). In the case of a Disaster Recovery, I'd think we'd
want to
be pretty sure ones index is good to go.

I originally posted a question about how one could solve this with ES,
you
can see my original post in [1] outlined at the end.

We have an Index Verifier application built in-house that worked with
our
initial custom index framework we built many years ago. Every day it
checked 145 million records for correctness. But it's now time to move
to ES
and we still need an Index Verifier, and I'd think the entire ES
community
could benefit from it too, collaborate and improve on it as a shared
project.

The basic principle is to use the ID, and a 'timestamp' as a version (an
epoch time stored as a long) signature for each record. Every time a
row in
your data source (DB) is modified, the timestamp is updated
automatically,
say by a DB trigger. We call this 'lastupdated'. When a record is
modified, the index is updated, and also stores this lastupdated
timestamp
signature. You have extremely high confidence that your index
information
is correct for that item if the timestamps match between your source and
the
index version.

Elasticsearch's built-in support for the ID and 'version' concept is
fantastic. By using the 'external' property of the version
( ..setVersion(timestamp)).setVersionType(VersionType.EXTERNAL) ... in
Java
API speak) you lock in this information with the record in ES.

Index Verifier then takes a sorted stream of ID:Timestamp tuples from
your
source of truth and does a comparison against the stream from ES.
Walking
the streams it looks for gaps - missing IDs in the index, IDs that have
been
deleted in the source stream, but were correspondingly not deleted in
the
index, or matching IDs with a mismatch in timestamps. Since you
probably
have an index on your ID column in your db, retrieving that stream in
sorted
from the DB is trivial and fast.

Getting the stream sorted from Elasticsearch is the kicker. Retrieving
AND sorting large number of results is not efficient in ES (or any
Lucene-based framework) because Lucene relies on a PriorityQueue for
sorting, and a large result is very inefficient here (LOTS of memory).
Enter the _scan API Elasticsearch has.

My initial testing of an index with 10 million records on a simple 2
node
cluster with vanilla config, using the _scan API to retrieve batches of
100,000 records from ES just retrieving the ID and Timestamp, and
writing it
to a local file, then doing an external merge stream sort efficiently
took ~
62 seconds on my MBP, roughly average of 150,000/second. The merge-sort
took
an only an extra 9 seconds. It took a bit over 7 minutes to index this
relatively simple index (no replicas) and as an index grows in
complexity,
usually so does the indexing time, so you can hopefully see the Index
Verification process is a nice quick way to check the state.

This is what we call a "Full" Index Verification. There is a way to
perform
a "Partial" verification by only looking at changes from the source side
since a certain timestamp. By adding a trigger on Delete on the
source(DB)
side to track when a row is deleted and keeping an ID:Timestamp tuple as
a
companion table, you can retrieve with some crafty SQL a stream of
ID:timestamp: tuples from your DB that contain a
sorted stream since a particular time period, including when a row was
deleted. You can retrieve from ES the same by using the _scan API again
with a filter based on the timestamp. Walking the now much smaller
sorted
streams matching them up, looking for IDs not in the index, or deletes
that
never made it, or mismatched timestamp signatures. This is why having a
timestamp is a nice 'version' signature, because it has a temporal
property
useful in this scenario. If you do one Full verification, and then a
Partial run that overlaps the time window when you last did the full,
you
can have confidence of the index state of recent changes.

Just like many filesystems, this partial mode could be run frequently
during
the day to 'scrub' any off data, just like some filesystems scrub
looking
for any errors more frequently, rather than doing one large one.

There's opportunities to provide a range of input source types (DB,
HBase,
MongoDB etc) by a simple plugin system. We're specifically starting off
with the DB in mind, since that's our use case and is probably pretty
common, but a community effort around this could produce something
pretty
cool. Honestly, no reason the 'check' source could be something other
than
Elasticsearch, could be useful for Solr too.

We'll be putting it up on GitHub under an ASL 2 license, we'll let you
know
more once it's arrived via this channel. I would like to hear whether
there
are people in using this application, and maybe whether anyone is
interested
in collaboration on it.

cheers,

Paul Smith

[1] Index Verification - original threadhttp://
elasticsearch-users.115913.n3.nabble.com/Index-Verification-td...