Unclear usage of es.read.field.as.array.exclude


(Efst) #1

Hello, I am trying to use the setting es.read.field.as.array.exclude

From the name of it, I understand that it must produce the opposite result of es.read.field.as.array.include.
es.read.field.as.array.include transforms the type a field to array so it seems reasonable the exclude would transform the type of a field from an array to its element type.

The documentation states that:

es.read.field.as.array.exclude
Fields/properties that should be considered as arrays/lists

I think that this might just be a copy-paste error.

I made an experiment and created an index with the following mapping

{
    "test_index": {
        "mappings": {
            "test_type": {
                "dynamic": "strict",
                "_all": {
                    "enabled": false
                },
                "properties": {
                    "outer": {
                        "type": "nested",
                        "properties": {
                            "inner1": {
                                "type": "boolean"
                            },
                            "inner2": {
                                "type": "keyword"
                            }
                        }
                    }
                }
            }
        }
    }
}

Nested fields are transformed by default to arrays but in my case the field is just a struct.

Loading the data with the option
set(ConfigurationOptions.ES_READ_FIELD_AS_ARRAY_EXCLUDE, "outer")

gives me the following schema:

root
 |-- outer: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- inner1: boolean (nullable = true)
 |    |    |-- inner2: string (nullable = true)

As you can see the the field outer has the type of array despite the fact that it was part of the exclusions.

Am I missing something?
How does ConfigurationOptions.ES_READ_FIELD_AS_ARRAY_EXCLUDE work?

I' m using spark 2.3.2, elasticsearch 6.5.0 and elasticsearch-hadoop 6.5.0


(James Baiera) #2

@markoutso this is a great question. Hopefully I can shed some light on what's going on here.

Elasticsearch has no concept of differentiating fields based on how many values they contain. Every field is treated as an array of values at the lowest levels in the search engine, with singular values becoming singleton arrays. Even though the fields become singleton arrays at the lowest levels, the data that end users (the ES-Hadoop library included) interact with is JSON. A keyword field can have documents in the same index each with values of different formats like "test", ["test"], and ["test", "test2"]. Prior to actually reaching and reading a document with a field that contains an array value, there would be no way to know that the field should be discovered as an array for the purposes of schema representation. The resulting error is a confusing class cast exception from Spark's catalyst optimizer. This is why ES-Hadoop relies on users to say when a field should be read as an array.

The ES_READ_FIELD_AS_ARRAY_EXCLUDE configuration, specifically is only for fields that can be considered singular OR array values. Field exclusion is more specific than field inclusion. If you were to add a pattern that matches fields in the include setting, the exclude setting would be checked and respected first in order to remove fields that shouldn't really match the pattern.

This all changes when reading nested data. In the example that you have provided, the outer field is marked as a nested field. Nested fields are always treated like an array of objects in ES-Hadoop. It's pretty much the only sensible use for them instead of an object field. Because we are able to discern from the mapping that the field should be treated as an array (nested vs object), we don't even check the include or exclude configurations to see if they match, since if a nested field were to be marked as excluded but end up having more than one nested object (which is the whole point of the field) then you'd end up with those same obtuse class cast exceptions in Spark's Catalyst Optimizer when it gets an Array[String] instead of the expected String, or vice versa.

Hope this helps!


(Efst) #3

@james.baiera Thanks for your very thorough explanation.
Everything that you say makes sense to me and I understand why the behaviour of ElasticSearch is the way it is.

If one needs to store just one value then the proper type to use is object and not nested.
In my case though the damage is done and from what I know (correct me if I am wrong) changing the mapping is a painful process.

The solution was to transform the dataframe by extracting the value out of the singleton list and changing the row schema, but that also costs something.

This cost could have been avoided with a simple change in org.elasticsearch.hadoop.serialization.ScrollReader#read. Unfortunately there is no option to override this class.

In any case, I think it's good to improve the documentation of es.read.field.as.array.exclude :slight_smile: