[Announce] Scrutineer for ElasticSearch - Detecting inconsistencies in your data

Paul_Smith · November 14, 2011, 1:50am

Just wanted to let everyone know that we (Aconex) have open-sourced an ASL2
licensed utility to look for any inconsistencies between the information
stored in ElasticSearch and a JDBC database (with extension points for
other sources).

You can find the project here: https://github.com/Aconex/scrutineer

First off: This tool was NOT developed because ElasticSearch is buggy,
NOOOO, this is because can happen. If you rely on the data
stored in elasticsearch being accurate (that is, your client application is
sending it the right info) than Scrutineer can help. For those
applications where Near-Real-Time (NRT) indexing is part of the use case,
then Scrutineer is very useful.

Scrutineer compares the Version property stored in your ElasticSearch
record with a matching one from your source-of-truth (say DB) and reports
inconsistent state and missing records, relying on you indexing this
Version property using the VersionType.EXTERNAL flag.

Scrutineer can be used in many cases where a full reindex would be very
costly, such as:

Detecting and reindexing 1-50% of your index may be quicker than
performing a full reindex. Once the error rate approaches >50% it may just
be quicker to reindex though. Your mileage may vary.
Reindex preparation - You could prepare a new copy of your index on your
cluster (or a different one) using a snapshot of your production DB (to
minimise adverse load/disruptions to production customers), then use
Scrutineer to quickly find and index what's changed since that DB snapshot
then switch your index alias to the new one.
"Filesystem" scrubber. Run regularly in production to find and hand off
to another tool to fix any errors before it affects your customers.
Disaster Recovery - if you copy a gateway snapshot to a DR location
periodically to keep closely in sync with your DB, Scrutineer can be used
at Disaster time (heaven forbid) to make sure your index is in sync with
your recovered DB.
If you ever had to take your database 'back in time' due to a
catastrophic failure you could recover your elasticsearch indexstate
quickly by 'rollingback' changes; deleting records that are no longer in
the db and freshening stale entries back to their earlier states. This may
be much faster than a full reindex depending on how far back you have to go.

Scrutineer is pretty fast, for reference it took 3.5 seconds to verify 275k
records from a 2-node ES cluster (with vanilla config) against a database
table.

There's a tarball download here:

If you have any questions, please let us know.

cheers,

Paul Smith

dadoonet · November 14, 2011, 6:54am

Hi Paul,

That's a really nice tool you gave to us !
I will play with it this week.

Is there any way to bypass the version check ?
My use case doesn't need version check.

If not, I will have a look and try to submit a pull request.

Cheers,
David
@dadoonet

Le 14 nov. 2011 à 02:50, Paul Smith tallpsmith@gmail.com a écrit :

Just wanted to let everyone know that we (Aconex) have open-sourced an ASL2 licensed utility to look for any inconsistencies between the information stored in Elasticsearch and a JDBC database (with extension points for other sources).

You can find the project here: https://github.com/Aconex/scrutineer

First off: This tool was NOT developed because Elasticsearch is buggy, NOOOO, this is because can happen. If you rely on the data stored in elasticsearch being accurate (that is, your client application is sending it the right info) than Scrutineer can help. For those applications where Near-Real-Time (NRT) indexing is part of the use case, then Scrutineer is very useful.

Scrutineer compares the Version property stored in your Elasticsearch record with a matching one from your source-of-truth (say DB) and reports inconsistent state and missing records, relying on you indexing this Version property using the VersionType.EXTERNAL flag.

Scrutineer can be used in many cases where a full reindex would be very costly, such as:

Detecting and reindexing 1-50% of your index may be quicker than performing a full reindex. Once the error rate approaches >50% it may just be quicker to reindex though. Your mileage may vary.

Reindex preparation - You could prepare a new copy of your index on your cluster (or a different one) using a snapshot of your production DB (to minimise adverse load/disruptions to production customers), then use Scrutineer to quickly find and index what's changed since that DB snapshot then switch your index alias to the new one.

"Filesystem" scrubber. Run regularly in production to find and hand off to another tool to fix any errors before it affects your customers.

Disaster Recovery - if you copy a gateway snapshot to a DR location periodically to keep closely in sync with your DB, Scrutineer can be used at Disaster time (heaven forbid) to make sure your index is in sync with your recovered DB.

If you ever had to take your database 'back in time' due to a catastrophic failure you could recover your elasticsearch indexstate quickly by 'rollingback' changes; deleting records that are no longer in the db and freshening stale entries back to their earlier states. This may be much faster than a full reindex depending on how far back you have to go.

Scrutineer is pretty fast, for reference it took 3.5 seconds to verify 275k records from a 2-node ES cluster (with vanilla config) against a database table.

There's a tarball download here: https://github.com/Aconex/scrutineer/downloads

If you have any questions, please let us know.

cheers,

Paul Smith

Paul_Smith · November 14, 2011, 6:57am

Just set your version propert in ES to 0 then and return a literal 0
from the db query for the 2nd column. That should work.

On Monday, November 14, 2011, David Pilato david@pilato.fr wrote:

Hi Paul,

That's a really nice tool you gave to us !I will play with it this week.
Is there any way to bypass the version check ?My use case doesn't need version check.
If not, I will have a look and try to submit a pull request.
Cheers,
David @dadoonet
Le 14 nov. 2011 à 02:50, Paul Smith tallpsmith@gmail.com a écrit :

Just wanted to let everyone know that we (Aconex) have open-sourced an ASL2 licensed utility to look for any inconsistencies between the information stored in Elasticsearch and a JDBC database (with extension points for other sources).

You can find the project here: https://github.com/Aconex/scrutineer

First off: This tool was NOT developed because Elasticsearch is buggy, NOOOO, this is because can happen. If you rely on the data stored in elasticsearch being accurate (that is, your client application is sending it the right info) than Scrutineer can help. For those applications where Near-Real-Time (NRT) indexing is part of the use case, then Scrutineer is very useful.

Scrutineer compares the Version property stored in your Elasticsearch record with a matching one from your source-of-truth (say DB) and reports inconsistent state and missing records, relying on you indexing this Version property using the VersionType.EXTERNAL flag.

Scrutineer can be used in many cases where a full reindex would be very costly, such as:

Detecting and reindexing 1-50% of your index may be quicker than performing a full reindex. Once the error rate approaches >50% it may just be quicker to reindex though. Your mileage may vary.

Reindex preparation - You could prepare a new copy of your index on your cluster (or a different one) using a snapshot of your production DB (to minimise adverse load/disruptions to production customers), then use Scrutineer to quickly find and index what's changed since that DB snapshot then switch your index alias to the new one.

"Filesystem" scrubber. Run regularly in production to find and hand off to another tool to fix any errors before it affects your customers.

Disaster Recovery - if you copy a gateway snapshot to a DR location periodically to keep closely in sync with your DB, Scrutineer can be used at Disaster time (heaven forbid) to make sure your index is in sync with your recovered DB.

If you ever had to take your database 'back in time' due to a catastrophic failure you could recover your elasticsearch indexstate quickly by 'rollingback' changes; deleting records that are no longer in the db and freshening stale entries back to their earlier states. This may be much faster than a full reindex depending on how far back you have to go.

Scrutineer is pretty fast, for reference it took 3.5 seconds to verify 275k records from a 2-node ES cluster (with vanilla config) against a database table.
There's a tarball download here: https://github.com/Aconex/scrutineer/downloads

If you have any questions, please let us know.
cheers,
Paul Smith

kimchy · November 15, 2011, 7:06am

Great job Paul!, thanks for sharing this with the community. The effort of
making your own in house project open source comes with a cost, thanks for
going through it

On Mon, Nov 14, 2011 at 3:50 AM, Paul Smith tallpsmith@gmail.com wrote:

Just wanted to let everyone know that we (Aconex) have open-sourced an
ASL2 licensed utility to look for any inconsistencies between the
information stored in Elasticsearch and a JDBC database (with extension
points for other sources).

You can find the project here: https://github.com/Aconex/scrutineer

First off: This tool was NOT developed because Elasticsearch is buggy,
NOOOO, this is because can happen. If you rely on the data
stored in elasticsearch being accurate (that is, your client application is
sending it the right info) than Scrutineer can help. For those
applications where Near-Real-Time (NRT) indexing is part of the use case,
then Scrutineer is very useful.

Scrutineer compares the Version property stored in your Elasticsearch
record with a matching one from your source-of-truth (say DB) and reports
inconsistent state and missing records, relying on you indexing this
Version property using the VersionType.EXTERNAL flag.

Scrutineer can be used in many cases where a full reindex would be very
costly, such as:

Detecting and reindexing 1-50% of your index may be quicker than
performing a full reindex. Once the error rate approaches >50% it may just
be quicker to reindex though. Your mileage may vary.

Reindex preparation - You could prepare a new copy of your index on your
cluster (or a different one) using a snapshot of your production DB (to
minimise adverse load/disruptions to production customers), then use
Scrutineer to quickly find and index what's changed since that DB snapshot
then switch your index alias to the new one.

"Filesystem" scrubber. Run regularly in production to find and hand off
to another tool to fix any errors before it affects your customers.

Disaster Recovery - if you copy a gateway snapshot to a DR location
periodically to keep closely in sync with your DB, Scrutineer can be used
at Disaster time (heaven forbid) to make sure your index is in sync with
your recovered DB.

If you ever had to take your database 'back in time' due to a
catastrophic failure you could recover your elasticsearch indexstate
quickly by 'rollingback' changes; deleting records that are no longer in
the db and freshening stale entries back to their earlier states. This may
be much faster than a full reindex depending on how far back you have to go.

Scrutineer is pretty fast, for reference it took 3.5 seconds to verify
275k records from a 2-node ES cluster (with vanilla config) against a
database table.

There's a tarball download here:
https://github.com/Aconex/scrutineer/downloads

If you have any questions, please let us know.

cheers,

Paul Smith

Topic		Replies	Views
[Proposal] Index Verification Elasticsearch	5	342	July 6, 2017
How to re-verify data consistency with external RDBMS source Elasticsearch	5	682	August 10, 2017
Detect data inconsistency with Elasticsearch Elasticsearch	1	611	December 25, 2018
Missing documents after _reindex of daily indices Elasticsearch	4	2252	April 19, 2018
Checking Index Integrity Elasticsearch	5	2718	July 5, 2017

[Announce] Scrutineer for ElasticSearch - Detecting inconsistencies in your data

Related topics