We want to apply a real strict mapping , but arrays are still accepted and mess around our data pipeline (because after we use Apache Spark to read data).
We (at our team level) have no control on the incoming data.
How can we specify a real strict mapping to reject such document?
PUT /poc-mapping
{
"settings": {
"number_of_shards": 1,
"index" : {
"mapping" : {
"ignore_malformed" : false
}
}
},
"mappings": {
"dynamic":"strict",
"properties": {
"request" : {
"properties": {
"uri" : {
"type" : "keyword"
}
}
}
}
}
}
# This should not work
POST /poc-mapping/_doc/1
{
"request" : {
"uri" : ["nop"]
}
}
I'm not aware of any mapping feature to disallow arrays. Also, how would you want to treat that scenario — reject the document?
My first idea for a workaround would be an index.default_pipeline that checks for multiple values in a field and then either fails the document or removes some elements. Would that work?
So sad we cant guarantee data integrity by using elasticsearch.
As a workaround, we have started to develop an ingestion service in front of elasticsearch, but I would prefer a solution 100% with elasticsearch mapping (easier to replicate etc...).
Especially with Apache Spark, data integrity is very important, I am really surprised this is not implemented
I guess the data integrity view depends on your expectations; there are no constraints like NOT NULL either but I'm not sure those are really what you want to use a search engine for.
FYI having left elastic I’m not working on the PR any more. It was a potentially big change and there was reluctance because this might represent the start of a very slippery slope of adding all kinds of validation (arrays, enums, acceptable numeric value ranges etc).
In many businesses elasticsearch gets fired at with all kinds of data streams that evolve constantly but still need to be captured. To cope with that leniency is more valued than strictness.
The lesson I take from proposing this change is there’s a policy line drawn where elasticsearch mappings attempt to perform validation declaratively Vs using application-specific code.
While I agree that mappings shouldn’t validate everything I think the vast majority of content stored in elasticsearch consists of single-valued fields and there should be a simple option to validate that on the way in. (It would have to be opt-in because the default has always supported multi value and too many things would break if elasticsearch changed that).
yes, to me there is a big gap between validation of the data content and data type itself.
If I can compare to any computing language, Array is a data type, and must be explicitely declared as is. All programmers know that. And this difference must be part of the data schema (here the mapping).
Then data content validation is another story.
Mixing array and concrete values are fine in most elasticsearch usage (search, aggregation ...) but it fails when using Apache Spark, which is a bug because we use the official elastic Spark "driver". (because spark is receiving string then string => kaboom)
I forget exactly where but there are already aspects to elasticsearch (data streams? Transforms?) where content is grouped or routed by a choice of field which absolutely has to be single-valued and not an array.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.