Nested Fields in an Index

xef · December 2, 2024, 10:49am

We have an Index with a Nested field. Most of the documents will have their nested field null but some will have it populated.
What would the most efficient way of querying such index so that if there are no docs with nested fields we do not incur the overhead of having to look for the nested field values when they are all null.
i.e. At the moment the only thing we can think of is somethomg like:
query for the documents whose nested fileds are null OR the nested fields have x,y,zvalues.
Thanks

john-wagster · December 2, 2024, 7:44pm

May be worthwhile to understand your use-case a bit more (potentially with some examples) and what your primary concern is such as query speed. Some things that may complicate or change the answer here is if you delete documents or update documents frequently. And whether you can get away with object types instead of nested types.

If we assume a relatively simple case where we want to optimize for query speed and nested is the correct type for your use-case, then my gut reaction is that you have a few things you can try. If you take a look at the nested field docs, you have a couple (defaulted to false) optional fields include_in_parent and include_in_root both of which automatically copy the data to the parent or root document respectively as well for sake of allowing faster filtering. If the nested field though is large and copying is untenable for any reason my next immediate thought is you could explicitly add a field to the parent document indicating whether you have nested data or not, essentially a boolean field something like has_nested_data but you'd need to be careful about updating it appropriately if you ever go to update your documents. For me an additional boolean field there is extremely fast in query performance significantly more so than an exists query on a nested field.

Let me know if you play around with that or that doesn't quite fit your use case though I'd be keen to hear more and help out.

xef · December 4, 2024, 11:29pm

Hi
Thaks for your reply.
But using include_in_parent and include_in_root will flatten the data structure which defeats the whole purpose of having a nested field as we cannot crreate queries for each individual record within the nested field. Or am I wrong.

We were thonking of indexing the nested fields as actual documents and then create aggregates that combines them but that creates another problem!.
Your idea of a bool field is good but resukts in overheads of creating basically additional filters.
For the time being we are lost as to what would be the best solution.
Thanks for the time you spent in providing us with your reply.

john-wagster · December 5, 2024, 1:02am

include_in_parent and include_in_root will create copies of your data say at the root for instance (and leave your nested data alone so it won't collapse the nested data just create a copy of it at the root that is flattened). If that's sufficient for filtering then it's a nice option because you don't need to manage anything additional. A slightly more nuanced alternatively would be to consider a flat structure for your data instead of or in addition to the nested data rather than relying on include_in_root something that maintains the semantics of the data for the sake of filtering and query. It's hard to give you a relatable example without knowing more about your use-case but here's a basic example:

if you currently have a mapping like this:

  "mappings": {
    "properties": {
      "person": {
        "type": "nested",
        "properties": {
            "first": {
                "type": "keyword"
            },
            "last": {
                "type": "keyword"
            }
        }
      }
    }
  }

which would be some nested field person with a list of first and last names instead you could encode the names as full names and search on and display that data instead:

  "mappings": {
    "properties": {
      "person": {
        "properties": {
            "fullname": {
                "type": "keyword"
            }
        }
      }
    }
  }

A must or must_not with an exists query is about your only other option that I'm aware of if you don't want to have to maintain an extra boolean field or something like include_in_root won't work for you. The downside of the exists is it costs more than the boolean field because you have to go to the nested field document to check if it exists. But that's definitely a good place to start if you haven't tried to form up that query yet. It's the most efficient option without changing the mapping in any way.

Here's what that exists might look like:

given this mapping:

  "mappings": {
    "properties": {
      "my_top_level_nested_field": {
        "type": "nested",
        "properties": {
            "first": {
                "type": "keyword"
            },
            "last": {
                "type": "keyword"
            }
        }
      }
    }
  }

and this query:

    "query": {
        "nested": {
            "path": "my_top_level_nested_field",
            "query": {
                "bool": {
                    "must": [
                        {
                            "exists": {
                                "field": "my_top_level_nested_field"
                            }
                        }
                    ]
                }
            }
        }
    }

if my_top_level_nested_field is null or [] then those hits will be excluded from the query results, which based on your description ("some" documents having the nested field populated) may be what you are looking for to give you back the smaller number of documents that have some nested data.

xef · December 5, 2024, 9:11pm

Hi
You are obviously much more knowledgeable in Elastic than I am.
Lets say we have a top level document representing a class Development.


Class Development
{
  List<Types> PropertyTypes
  String Locality
  Geometry etc
}

Class Types
{ 
  string TypeName
  int Bedrooms
  int Livingrooms
  long Price

}

We need to be able to query the root document for say the locality AND also the individual nested docs. Say if a user wants to see the root document if it is located in "New York" AND has Types with 3 bedrooms that have a Price <300000

Also a doucment can be representd a **single property ** (i.e. not a development) which has bedrooms, living rooms and price etc but NO nested fields

So we need to get those documents at root level PLUS the root documents that have nested fields satisfyong the query.

john-wagster · December 11, 2024, 7:00pm

There's a few ways to think about this and they all have tradeoffs.

Usually when I'm thinking about settings up mappings. I am looking at 1 of 3 approaches. Either setup the mappings to be similar to what you might see in a relational database, set them up flattened or pre-joined (prior to loading the data), or set them up expecting to join them in the application layer. The immediate benefits are that relational-style data can be easier particularly if you are just getting comfortable with distributed databases and works well with smaller datasets. Pre-joined data is usually the most efficient but also the most difficult to change when inevitably your data model needs to change. Data joined at the application layer adds in overhead as joining at that time is often pretty expensive and so is really only recommended when a UX exists that supports / sets expectations of waiting for a result. I can't tell you what the right approach for your use-case is though. You'll have to assess that for yourself. It's based on several factors like what kind of hardware, how responsive is your application, how much data you have, etc.

Having said that here's how you would model the data for those three use-cases and then subsequently how you might query that data with some considerations for exists:

relational-like data model is very similar to your class structure above:

  "mappings": {
    "properties": {
      "geometry": {
          "type": "geo_shape"
      },
     "locality": {
    	  "type": "keyword"
      },
      "property_type": {
        "type": "nested",
        "properties": {
            "name": {
                "type": "keyword"
            },
            "bedrooms": {
                "type": "integer"
            },
            "livingrooms": {
                "type": "integer"
            },
            "price": {
                "type": "long"
            }            
        }
      }
    }
  }

It has the downsides we've discussed so far in that it's slightly more expensive to query exists, but it is more than capable of satisfying your queries intuitively.

Here's an example of that query you mentioned:

    "query": {
        "bool": {
            "filter": [
                {
                    "term": {
                        "locality": "New York"
                    }
                },                
                {
                    "nested": {
                        "path": "property_type",
                        "query": {
                            "bool": {
                                "filter": [
                                    {
                                        "term": {
                                            "property_type.bedrooms": 3
                                        }
                                    },
                                    {
                                        "range": {
                                            "property_type.price": {
                                                "lt": 300000
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }

You get back the nested listing document and the root level development document.

You don't need to do exists in the context of those queries and I wouldn't recommend it as the query engine under the hood is likely better without that clause. However, here's what an exists might looks like if you say want to get back all listing for a specific locality:

    "query": {
        "bool": {
            "filter": [
                {
                    "term": {
                        "locality": "New York"
                    }
                },                
                {
                    "nested": {
                        "path": "property_type",
                        "query": {
                            "bool": {
                                "filter": [
                                    {
                                        "exists": {
                                            "field": "property_type"
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }

This to me is more than sufficient and great place to start if you don't have a ton of data. And in the future you can evaluate better or more efficient mappings. Don't let good enough get in the way of getting started.

However, this likely is not the fastest you could go. I mentioned previously that adding a boolean into that mapping would likely prevent exists queries from having to query the nested document. I would expect this to be most beneficial when say you have very few listings total but lots of developments and only want to return for some query on developments that have listings. An extra boolean at the root level would be an excellent optimization here and not super expensive to maintain. Otherwise don't worry too much about exists just make queries for what you need.

If you consider the two other architectural optimizations, pre-join and application-join, though you might consider this mapping instead that flattens or pre-joins data. In this case you would be storing data in the index at a listing level with all of the development information duplicated and for the sake of query performance that can definitely be worthwhile:

  "mappings": {
    "properties": {
      "geometry": {
          "type": "geo_shape"
      },
     "locality": {
  	   "type": "keyword"
       },
      "name": {
          "type": "keyword"
      },
      "bedrooms": {
          "type": "integer"
      },
      "livingrooms": {
          "type": "integer"
      },
      "price": {
          "type": "long"
      }
    }
  }

This is great for query speed for specifically listings OR (listings AND developments). However, if you need to search just developments then this adds a lot of overhead. In that case duplicating the data further into a development-only index makes sense. This maximizes the query performance. The downside is that there's more to maintain here in terms of how the data is loaded and how the mappings evolve. It sort of assumes you know your query use-cases well enough to do go ahead and pre-join in anticipation of those queries.

One great reason to consider a flattened listenings mapping like the one above AND a mapping that only contains developments that have at least one listing is if you have different views you need to drive in an application. Say you have an application entry-point with all of the developments that have listings as the default view. To make that as responsive as possible I would consider just an index of developments that have at least one listing and pre-join the data in your data processing to do so (as in insert / upsert to that index any time you see a listing for that development).

Alternatively you could take and model developments and listings as separate mappings and join them after performing two queries within your application logic. Frankly I don't typically recommend this unless you are building a reporting application with a lot of unknown queries and where slow performance is expected OR where the total data is relatively small in comparison to your application hardware. This definitely gives you the most flexible of the three options though.

xef · December 11, 2024, 10:21pm

Hi. Thanks very much but as mentioned the first option you have listed does not include properties that have toplevel bedrooms, livingrooms price etc without any nested fileds i.e. individual properties.

But thanks any way for all the time you ivested in this. We will just have to try and just take the best option.

By the way, if you can provide us a solution, we are quite happy to pay for it.

john-wagster · December 11, 2024, 11:09pm

Hi. Thanks very much but as mentioned the first option you have listed does not include properties that have toplevel bedrooms, livingrooms price etc without any nested fileds i.e. individual properties.

The first one doesn't need top level properties it should work just fine as a nested query I mentioned subsequently.

We will just have to try and just take the best option.

Completely agree; I would say from this point try some things and do some benchmarking and you may find there's a nice option that works well for you.

By the way, if you can provide us a solution, we are quite happy to pay for it.

While I can't provide a solution without a lot more information about your problem, applications, and roadmap. I also am not really in a position to do so. However, at Elastic we do have consulting services that I'm sure would be happy to dig into your problem and help out: Consulting Services for the Elastic Stack | Elastic Consulting. I've worked with them in past and been really happy. And I know there's several other companies that provide Elasticsearch consulting although I'm not immediately sure which ones are good or not.

xef · December 11, 2024, 11:39pm

Thank you and very apprciative of your inputs.

Topic		Replies	Views
Queries with nested fields and "having" clauses Elasticsearch	4	25	September 19, 2024
Are nested docs efficiently skipped? Elasticsearch	5	462	June 3, 2021
Nested queries (or secondary query) Elasticsearch	1	252	March 7, 2022
Find doucment with blank nested field Elasticsearch	5	1732	July 5, 2017
[Query DSL] Get documents which don't have a sepcific value (in array, object/nested) Elasticsearch	1	320	December 23, 2020

Nested Fields in an Index

Related topics