Validate data before indexing them

Hello,

I wonder if there is a built-in way to validate data before indexing
them. I see two kinds of validation:

  • Structural validation of fields based on a regular expression for
    example. Perhaps something can be configured in the mapping...
  • Integrity validation of document. For example preventing from indexing
    a document with a field value that already exists.

In the case where there is no built-in support at the moment, is there a
way to extend ElasticSearch to add such processing before indexing using
the standard REST calls?

Thanks very much for your help!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/52FFDEC9.7020007%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

You can validate the data at client side in your model before serializing
it to JSON, or after a complete bulk index run.

There are reasons why Elasticsarch is schema-less. It is equivalent to
allow any number of different fields (keys) and any content in fields
(values) without any logical constraints.

In a distributed system, commits per field, or transactions per field, or
integrity checking can get very expensive. Because the index is inverted,
and nodes can come and go, there is a significant penalty if you want
document transaction safety and document integrity checks.

I validate data in ES with the help o a large scan/scroll over the docs
after bulk indexing, by searching for IDs if they exist or not. This is
different from integrity constraint checking techniques like rule based
methos known from RDBMs.

Jörg

On Sat, Feb 15, 2014 at 10:40 PM, Thierry Templier templth@gmail.comwrote:

Hello,

I wonder if there is a built-in way to validate data before indexing them.
I see two kinds of validation:

  • Structural validation of fields based on a regular expression for
    example. Perhaps something can be configured in the mapping...
  • Integrity validation of document. For example preventing from indexing a
    document with a field value that already exists.

In the case where there is no built-in support at the moment, is there a
way to extend Elasticsearch to add such processing before indexing using
the standard REST calls?

Thanks very much for your help!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/52FFDEC9.7020007%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGXXQxQ%2B5PRwrHw33uj3%2B8WwqLKiAZvnQrZ8bYUMfKYSw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks very much, Jörg, for your answer! I see the approach...

I understand well that having integrity checks in a schema-less engine
like Elasticsearch isn't possible. However, would it be possible to have
checks at field structure level before triggering indexing in the ES
engine. Perhaps with specifying something like that in the mapping:

{
"mappings": {
"mydoc": {
"properties": {
(...)
"name": {
"type":"string", "store":"yes", "index":"analyzed",
"checks": "not_null,not_empty,regexp=[1]$"
},
(...)
}
}
}
}

Thanks very much for your help!
Thierry

You can validate the data at client side in your model before
serializing it to JSON, or after a complete bulk index run.

There are reasons why Elasticsarch is schema-less. It is equivalent to
allow any number of different fields (keys) and any content in fields
(values) without any logical constraints.

In a distributed system, commits per field, or transactions per field,
or integrity checking can get very expensive. Because the index is
inverted, and nodes can come and go, there is a significant penalty if
you want document transaction safety and document integrity checks.

I validate data in ES with the help o a large scan/scroll over the
docs after bulk indexing, by searching for IDs if they exist or not.
This is different from integrity constraint checking techniques like
rule based methos known from RDBMs.

Jörg

On Sat, Feb 15, 2014 at 10:40 PM, Thierry Templier <templth@gmail.com
mailto:templth@gmail.com> wrote:

Hello,

I wonder if there is a built-in way to validate data before
indexing them. I see two kinds of validation:

* Structural validation of fields based on a regular expression
for example. Perhaps something can be configured in the mapping...
* Integrity validation of document. For example preventing from
indexing a document with a field value that already exists.

In the case where there is no built-in support at the moment, is
there a way to extend ElasticSearch to add such processing before
indexing using the standard REST calls?

Thanks very much for your help!

-- 
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com
<mailto:elasticsearch%2Bunsubscribe@googlegroups.com>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/52FFDEC9.7020007%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGXXQxQ%2B5PRwrHw33uj3%2B8WwqLKiAZvnQrZ8bYUMfKYSw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5301B2C8.1060305%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


  1. a-zA-Z ↩︎

I think, if this must be server side, this could be done in a REST filter
plugin, not in the mappings. Matching values against patterns do not relate
to field mappings.

Still, the best method to filter out unwanted values is at client side,
either at JSON construction time in the official clients, or in the
XContentBuilder calls in Java. Reason is to avoid extra load on the data
nodes that is not related to index or search.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF6RcbsLsMewFiFgFSdHT%3Dq%2BP%2BteaHBjRSXDk8Or-JXrQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jörg,

Following our conversation, I implemented a data validation filter. I
describe my approach at this address:
Implementing data validation in ElasticSearch | Sandbox for the Web stack.

I'd be pleased to have your feedback and comments on this! Thanks very much!

Thierry

I think, if this must be server side, this could be done in a REST
filter plugin, not in the mappings. Matching values against patterns
do not relate to field mappings.

Still, the best method to filter out unwanted values is at client
side, either at JSON construction time in the official clients, or in
the XContentBuilder calls in Java. Reason is to avoid extra load on
the data nodes that is not related to index or search.

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF6RcbsLsMewFiFgFSdHT%3Dq%2BP%2BteaHBjRSXDk8Or-JXrQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5319F19D.8070107%40gmail.com.
For more options, visit https://groups.google.com/d/optout.