Strange results when querying nested objects


#1

Elasticsearch version: 2.3.3
Plugins installed: no plugin
JVM version: 1.8.0_91
OS version: Linux version 3.19.0-56-generic (Ubuntu 4.8.2-19ubuntu1)

I get strange results when I query nested objects on multiple paths. I want to search for all female with dementia. And there are matched patients among the results. But I also get other diagnoses I'm not looking for, the diagnoses related to these patients.

For example, I also get the following diagnoses despite the fact that I looked only for dementia.

  • Mental disorder, not otherwise specified
  • Essential (primary) hypertension

Why is that?
I want to get only female with dementia and don't want other diagnoses.

Client_Demographic_Details contains one document per patient. Diagnosis contains multiple documents per patient. The ultimate goal is to index my whole data from PostgreSQL DB (72 tables, over 1600 columns in total) into Elasticsearch.

Query:

{'query': {
       'bool': {
           'must': [
               {'nested': {
                   'path': 'Diagnosis',
                   'query': {
                       'bool': {
                           'must': [{'match_phrase': {'Diagnosis.Diagnosis': {'query': "dementia"}}}]
                       }  
                   }
               }},
               {'nested': {
                   'path': 'Client_Demographic_Details',
                   'query': {
                       'bool': {
                           'must': [{'match_phrase': {'Client_Demographic_Details.Gender_Description': {'query': "female"}}}]
                       }  
                   }
               }}
           ]
       }
    }}

Results:

{
  "hits": {
    "hits": [
      {
        "_score": 3.4594634, 
        "_type": "Patient", 
        "_id": "72", 
        "_source": {
          "Client_Demographic_Details": [
            {
              "Gender_Description": "Female", 
              "Patient_ID": 72, 
            }
          ], 
          "Diagnosis": [
            {
              "Diagnosis": "F00.0 -  Dementia in Alzheimer's disease with early onset", 
              "Patient_ID": 72, 
            }, 
            {
              "Patient_ID": 72, 
              "Diagnosis": "F99.X -  Mental disorder, not otherwise specified", 
            }, 
            {
              "Patient_ID": 72, 
              "Diagnosis": "I10.X -  Essential (primary) hypertension", 
            }
          ]
        }, 
        "_index": "denorm1"
      }
    ], 
    "total": 6, 
    "max_score": 3.4594634
  }, 
  "_shards": {
    "successful": 5, 
    "failed": 0, 
    "total": 5
  }, 
  "took": 8, 
  "timed_out": false
}

Mapping:

{
  "denorm1" : {
    "aliases" : { },
    "mappings" : {
      "Patient" : {
        "properties" : {
          "Client_Demographic_Details" : {
            "type" : "nested",
            "properties" : {
              "Patient_ID" : {
                "type" : "long"
              },
              "Gender_Description" : {
                "type" : "string"
              }
            }
          },
          "Diagnosis" : {
            "type" : "nested",
            "properties" : {
              "Patient_ID" : {
                "type" : "long"
              },
              "Diagnosis" : {
                "type" : "string"
              }
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1473974457603",
        "number_of_shards" : "5",
        "number_of_replicas" : "1",
        "uuid" : "Jo9cI4kRQjeWcZ7WMB6ZAw",
        "version" : {
          "created" : "2030399"
        }
      }
    },
    "warmers" : { }
  }
}

(Mark Harwood) #2

If your document represents a single patient why is Client_Demographic_Details an array? Do you deliberately allow for multiple IDs and genders for a patient?


#3

Matched doc on nested will be inside inner hits and rest in source.
This working query from here http://stackoverflow.com/questions/39527012/strange-results-when-querying-nested-objects

{
  "_source": {
    "exclude": [
      "Client_Demographic_Details",
      "Diagnosis"
    ]
  },
  "query": {
    "bool": {
      "must": [
        {
          "nested": {
            "path": "Diagnosis",
            "query": {
              "bool": {
                "must": [
                  {
                    "match_phrase": {
                      "Diagnosis.Diagnosis": {
                        "query": "dementia"
                      }
                    }
                  }
                ]
              }
            },
            "inner_hits": {}
          }
        },
        {
          "nested": {
            "path": "Client_Demographic_Details",
            "query": {
              "bool": {
                "must": [
                  {
                    "match_phrase": {
                      "Client_Demographic_Details.Gender_Description": {
                        "query": "female"
                      }
                    }
                  }
                ]
              }
            },
            "inner_hits": {}
          }
        }
      ]
    }
  }
}

#4

No, there is only one patient id per patient.


(Mark Harwood) #5

So no need for arrays or "nested" mappings for that field then.

Technically speaking you only need to make the diagnoses "nested" too if your queries will test more than one property of each object each e.g. (psuedo code) description:dementia AND diagnoser:DrSmith. However your example was only querying the diagnosis description so if that's all you do there's no harm in just having a non-nested property for diagnosis.


#6

I want to run boolean queries to test multiple properties from multiple nested objects. I plan to have 72 nested objects per patient. Each nested object contains data from the related PostgreSQL table.

So, you are saying that I need to have Client_Demographic_Details not nested and Diagnosis as nested object in the same document, aren't you? What benefits does the solution has against "all nested" objects solution that I have?


(Mark Harwood) #7

If there is only ever one Client_Demographic_Details object then that definitely does not need to be "nested".
Even with your diagnoses as non-nested you can happily query for people who have demograpic gender:female AND diagnosis:dementia.
The reason you reach for "nested" is only if you have criteria that needs to test more than one term in individual diagnosis objects. This is the "cross-matching" problem I illustrate in this example [1] using students and their examination results - the query is for people who have an exam result with a particular subject AND grade. Without "nested" the grades and subjects for a person's exams are muddled.
Hope this makes sense.
Because users have a hard time articulating complex nested logic, sometimes a "flat" system is preferable and users may actually want the laxness of cross-matching rather than the stricter controls used for nested.

[1] http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene


#8

Ok @Mark_Harwood, I will make Client_Demographic_Details as root object, try it and let you know.


#9

I tried your solution, it works. I agree it looks nicer from the hyrarchy point of view.


(system) #10