[SQL] Queries against nested datatypes may be mis-translated

Hi

Taking the example from Nested datatype and adapting it to avoid using a reserved keyword (group), we have the following setup.

DELETE my_index

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "user": {
          "type": "nested" 
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "groupName" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

Issuing the following query returns no results, as expected.

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }} 
          ]
        }
      }
    }
  }
}

However, issuing a syntactically very similar SQL query does return a result.

POST _xpack/sql?format=txt
{
  "query": "select groupName from my_index where user.first = 'Alice' and user.last = 'Smith'"
}

This happens because the above is /translated to a query with outer type bool rather than nested. I'm having a hard time convincing myself that ES is doing the right thing here, and I'd argue that the intent of the above is mistranslated, and that the query above should try to find a nested doc matching both predicates.

Paul

Hi @paulcarey,

This is an interesting scenario, but I don't know if we can do any better in this case. Having a bool with two nested statements in it (like it is behaving now), or a root nested query that has a bool in it with two term statements, how could these be differentiated at SQL query level?
No matter how you do it, the WHERE part has to have the form user.first='Alice' AND user.last='Smith'.

Agreed, and on reflection I think ES is doing the right thing here. I think there are three possible queries:

Retrieve documents where:

  • for a given parent doc, any combination of nested docs matches all of the nested doc predicates
    • currently implemented using and
    • select * from my_index where user.first = 'Alice' and user.last = 'Smith'
  • the parent doc matches any of the nested doc predicates
    • currently implemented using or
    • select * from my_index where user.first = 'Alice' or user.last = 'bar'
  • for a given parent doc, a single nested doc matches all of the nested doc predicates
    • not currently supported
    • as a suggested syntax, it could potentially be supported with a correlated subquery
      • select * from my_index where _id in (select _id from my_index.users where user.first = 'Alice' and user.last = 'Smith')
    • alternatively, it could be supported with a function or syntax, but this likely wouldn't play nicely with aggregations
      • select * from my_index where single_nested_doc(user.first = 'Alice' and user.last = 'Smith')

Incidentally, in exploring this issue I've encountered an issue where changing the projection changes the results.

Original query

POST _xpack/sql?format=txt
{
  "query": "select groupName from my_index where user.first = 'Alice' or user.last = '???'"
}

Original result

 groupName   
 ---------------
 fans           

Modified query with additional field of user.first

  POST _xpack/sql?format=txt
  {
    "query": "select groupName, user.first from my_index where user.first = 'Alice' or user.last = '???'"
  }                  

Result of modified query

     groupName   |  user.first   
  ---------------+---------------

This is surely a bug as the same single doc should be returned in these results.

One more wrinkle / bug.

When selecting all fields, a row is output for each nested doc.

POST _xpack/sql?format=txt
{   
  "query": "select groupName, user.first, user.last from my_index"
}   

groupName   |  user.first   |   user.last   
---------------+---------------+---------------
fans           |Alice          |White              
fans           |John           |Smith              

But if we add a where clause that's satisifed by a combination of the nested docs, then only a single row for a single nested doc is returned.

POST _xpack/sql?format=txt
{   
  "query": "select groupName, user.first, user.last from my_index where user.first = 'Alice' and user.last = 'Smith'"
}   


groupName   |  user.first   |   user.last   
---------------+---------------+---------------
fans           |John           |Smith              

This doesn't seem right, particularly as one of the predicates which must have been satisfied referred to 'Alice'.

@paulcarey would you mind creating an issue in github for an enhancement regarding nested queries?
This probably won't be on our top priority list, but it gives us an idea for future improvements that might be considered.

Thanks.

Sorry for the delay @paulcarey. I looked at these issues and the first one (with different projection the document disappears from the results) it's indeed a bug. The idea is that both nested queries that get created are for the same path and both use inner_hits. If one of the inner_hits doesn't return anything, the overall result is nothing, because they clash. It's a bit more complicated, but I hope now it's just a bit more clear.

The idea is to name each inner_hits statement so that the result can differentiate between them. And at the moment, ES-SQL doesn't do this.

And in the second scenario you mentioned it's basically the same root cause: clash between the two inner_hits and only one wins. I've created https://github.com/elastic/elasticsearch/issues/33079 and https://github.com/elastic/elasticsearch/issues/33080 to cover these issues.

Thank you for reporting them.

Great, many thanks for digging into these.

Regarding 33079, I was wondering if JsonPath or some variant of it had been considered as a way to define complex queries? For example, the Alice White query I mentioned above could be satisfied with:

$.user[?(@.first == 'Alice' && @.last == 'White')]

This can be tested on jsonpath.herokuapp.com.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.