[SQL] Surprising results with multiple levels of nesting

Hi

The following setup gives us a single document representing a country. The country has cities (nested), which has districts (nested), which has schools (nested).


DELETE countries

PUT countries
{
  "mappings": {
    "_doc": {
      "properties": {
        "cities": {
          "type": "nested",
          "properties": {
            "districts": {
              "type": "nested",
              "properties": {
                "name": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "schools": {
                  "type": "nested",
                  "properties": {
                    "name": {
                      "type": "text",
                      "fields": {
                        "keyword": {
                          "type": "keyword",
                          "ignore_above": 256
                        }
                      }
                    },
                    "students": {
                      "type": "long"
                    }
                  }
                }
              }
            },
            "name": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}



POST countries/_doc
{
  "name": "ireland",
    "cities": [
    {
      "name": "dublin",
      "districts": [
        {
          "name": "dun laoghaire",
          "schools": [
            {
              "name": "tao nan school",
              "students": 101
            },
            {
              "name": "colaiste eoin",
              "students": 202
            }
          ]
        }
      ]
    },
    {
      "name": "cork",
      "districts": [
        {
          "name": "ballincollig",
          "schools": [
            {
              "name": "scoil san treasa",
              "students": 50
            }
          ]
        }
      ]
    }
  ]
}

This query matches a single school name.

POST _xpack/sql?format=txt
{
  "query": "select cities.districts.name from countries where cities.districts.schools.name = 'colaiste eoin'"
}

It returns all district names, which I can reason about as ES returning all district names in the doc that was matched.

cities.districts.name
---------------------
ballincollig         
dun laoghaire        

However, the following query, with the same where clause, returns just a single school.

POST _xpack/sql?format=txt
{
  "query": "select cities.districts.schools.name from countries where cities.districts.schools.name = 'colaiste eoin'"
}

This is inconsistent with the results above where all districts were returned. If the query above returned just the parent district of the school, or if this query returned all schools, I think both types of results could be deemed reasonable. But the semantics seem to change depending on the projection which makes it hard to reason about. (Hopefully this is just another manifestion of 33080)

cities.districts.schools.name
-----------------------------
colaiste eoin                

The following query, which changes the projection to students fails with a null pointer.

POST _xpack/sql?format=txt
{
  "query": "select cities.districts.schools.students from countries where cities.districts.schools.name = 'colaiste eoin'"
}
{
  "error": {
    "root_cause": [
      {
        "type": "null_pointer_exception",
        "reason": null
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "Partial shards failure",
    "phase": "fetch",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "countries",
        "node": "6gnUSCJ7R9i7zq_kJS0dfA",
        "reason": {
          "type": "null_pointer_exception",
          "reason": null
        }
      }
    ]
  },
  "status": 500
}

But using students in the where clause seems fine.

POST _xpack/sql?format=txt
{
  "query": "select cities.districts.schools.name from countries where cities.districts.schools.students = 50"
}
cities.districts.schools.name
-----------------------------
scoil san treasa             

Thank you for heavily testing the nested documents with ES-SQL.
The first issue uncovered I think it's in essence more or less related to the previous issue I opened a GH issue for. What is actually happening is that ES-SQL creates one bool query with two must statements:

        "bool": {
            "must": [
                {
                    "nested": {
                        "query": {
                            "term": {
                                "cities.districts.schools.name.keyword": {
                                    "value": "colaiste eoin"
                                }
                            }
                        },
                        "path": "cities.districts.schools"
                    }
                },
                {
                    "nested": {
                        "query": {
                            "match_all": {}
                        },
                        "path": "cities.districts",
                        "inner_hits": {

The problem with the query above is that the inner_hits comes from the match_all, not from the term query. So the results are not right. A more correct approach would be:

{
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "cities.districts",
            "query": {
              "nested": {
                "path": "cities.districts.schools",
                "query": {
                  "term": {
                    "cities.districts.schools.name.keyword": {
                      "value": "colaiste eoin"
                    }
                  }
                }
              }
            },
            "inner_hits": {}
          }
        }
      ]
    }
  }
}

Which is a more complex nested query where the inner_hits is at a different level and the nested query has in it another nested query. The world of possibilities involving nested queries seems to be rather complex.

Regarding the NPE, that one comes from Elasticsearch itself. And for that I've created this issue in Github.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.