Mapping: how to disable array of string?

We want to apply a real strict mapping , but arrays are still accepted and mess around our data pipeline (because after we use Apache Spark to read data).

We (at our team level) have no control on the incoming data.

How can we specify a real strict mapping to reject such document?

PUT /poc-mapping
{
  "settings": {
    "number_of_shards": 1,
    "index" : {
      "mapping" : {
        "ignore_malformed" : false
      }
    }
  }, 
  "mappings": {
    "dynamic":"strict",
    "properties": {
      "request" : {
        "properties": {
          "uri" : {
            "type" : "keyword"
          }
        }
      }
    }
  }
}

# This should not work
POST /poc-mapping/_doc/1
{
  "request" : {
    "uri" : ["nop"]
  }
}

When you use strict:

If new fields are detected, an exception is thrown and the document is rejected. New fields must be explicitly added to the mapping.

I believe it will only restrict if it is not the mapped field, in which case the post below will fail (field request.url no mapped).

POST /poc-mapping/_doc/1
{
  "request" : {
    "url" : ["nop"]
  }
}

yes, but the field request.uri is mapped, and I want it as a real string, not an array of strings

I got the feeling that is not possible :confused:

Look like array of concrete values (string / number...) is treated like a concrete values.

I'm not aware of any mapping feature to disallow arrays. Also, how would you want to treat that scenario — reject the document?

My first idea for a workaround would be an index.default_pipeline that checks for multiple values in a field and then either fails the document or removes some elements. Would that work?

So sad we cant guarantee data integrity by using elasticsearch.

As a workaround, we have started to develop an ingestion service in front of elasticsearch, but I would prefer a solution 100% with elasticsearch mapping (easier to replicate etc...).

Especially with Apache Spark, data integrity is very important, I am really surprised this is not implemented

I guess the data integrity view depends on your expectations; there are no constraints like NOT NULL either but I'm not sure those are really what you want to use a search engine for.

BTW we almost got this feature: New field mapping flag - allow_multiple_values by markharwood · Pull Request #80289 · elastic/elasticsearch · GitHub
And there's also a bigger meta-issue around the topic: [Meta] Better handling of single-valued fields · Issue #80825 · elastic/elasticsearch · GitHub

In the end, I'd look into a default ingest pipeline. Feels reasonable to me and also as close to the destination as possible.

1 Like

Such a big PR!! Thanks you, I didnt find this issue.

FYI having left elastic I’m not working on the PR any more. It was a potentially big change and there was reluctance because this might represent the start of a very slippery slope of adding all kinds of validation (arrays, enums, acceptable numeric value ranges etc).
In many businesses elasticsearch gets fired at with all kinds of data streams that evolve constantly but still need to be captured. To cope with that leniency is more valued than strictness.
The lesson I take from proposing this change is there’s a policy line drawn where elasticsearch mappings attempt to perform validation declaratively Vs using application-specific code.

While I agree that mappings shouldn’t validate everything I think the vast majority of content stored in elasticsearch consists of single-valued fields and there should be a simple option to validate that on the way in. (It would have to be opt-in because the default has always supported multi value and too many things would break if elasticsearch changed that).

4 Likes

yes, to me there is a big gap between validation of the data content and data type itself.

If I can compare to any computing language, Array is a data type, and must be explicitely declared as is. All programmers know that. And this difference must be part of the data schema (here the mapping).

Then data content validation is another story.

Mixing array and concrete values are fine in most elasticsearch usage (search, aggregation ...) but it fails when using Apache Spark, which is a bug because we use the official elastic Spark "driver". (because spark is receiving string then string => kaboom)

I forget exactly where but there are already aspects to elasticsearch (data streams? Transforms?) where content is grouped or routed by a choice of field which absolutely has to be single-valued and not an array.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.