Nested Fields in an Index

john-wagster · December 11, 2024, 7:00pm

There's a few ways to think about this and they all have tradeoffs.

Usually when I'm thinking about settings up mappings. I am looking at 1 of 3 approaches. Either setup the mappings to be similar to what you might see in a relational database, set them up flattened or pre-joined (prior to loading the data), or set them up expecting to join them in the application layer. The immediate benefits are that relational-style data can be easier particularly if you are just getting comfortable with distributed databases and works well with smaller datasets. Pre-joined data is usually the most efficient but also the most difficult to change when inevitably your data model needs to change. Data joined at the application layer adds in overhead as joining at that time is often pretty expensive and so is really only recommended when a UX exists that supports / sets expectations of waiting for a result. I can't tell you what the right approach for your use-case is though. You'll have to assess that for yourself. It's based on several factors like what kind of hardware, how responsive is your application, how much data you have, etc.

Having said that here's how you would model the data for those three use-cases and then subsequently how you might query that data with some considerations for exists:

relational-like data model is very similar to your class structure above:

  "mappings": {
    "properties": {
      "geometry": {
          "type": "geo_shape"
      },
     "locality": {
    	  "type": "keyword"
      },
      "property_type": {
        "type": "nested",
        "properties": {
            "name": {
                "type": "keyword"
            },
            "bedrooms": {
                "type": "integer"
            },
            "livingrooms": {
                "type": "integer"
            },
            "price": {
                "type": "long"
            }            
        }
      }
    }
  }

It has the downsides we've discussed so far in that it's slightly more expensive to query exists, but it is more than capable of satisfying your queries intuitively.

Here's an example of that query you mentioned:

    "query": {
        "bool": {
            "filter": [
                {
                    "term": {
                        "locality": "New York"
                    }
                },                
                {
                    "nested": {
                        "path": "property_type",
                        "query": {
                            "bool": {
                                "filter": [
                                    {
                                        "term": {
                                            "property_type.bedrooms": 3
                                        }
                                    },
                                    {
                                        "range": {
                                            "property_type.price": {
                                                "lt": 300000
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }

You get back the nested listing document and the root level development document.

You don't need to do exists in the context of those queries and I wouldn't recommend it as the query engine under the hood is likely better without that clause. However, here's what an exists might looks like if you say want to get back all listing for a specific locality:

    "query": {
        "bool": {
            "filter": [
                {
                    "term": {
                        "locality": "New York"
                    }
                },                
                {
                    "nested": {
                        "path": "property_type",
                        "query": {
                            "bool": {
                                "filter": [
                                    {
                                        "exists": {
                                            "field": "property_type"
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
            ]
        }
    }

This to me is more than sufficient and great place to start if you don't have a ton of data. And in the future you can evaluate better or more efficient mappings. Don't let good enough get in the way of getting started.

However, this likely is not the fastest you could go. I mentioned previously that adding a boolean into that mapping would likely prevent exists queries from having to query the nested document. I would expect this to be most beneficial when say you have very few listings total but lots of developments and only want to return for some query on developments that have listings. An extra boolean at the root level would be an excellent optimization here and not super expensive to maintain. Otherwise don't worry too much about exists just make queries for what you need.

If you consider the two other architectural optimizations, pre-join and application-join, though you might consider this mapping instead that flattens or pre-joins data. In this case you would be storing data in the index at a listing level with all of the development information duplicated and for the sake of query performance that can definitely be worthwhile:

  "mappings": {
    "properties": {
      "geometry": {
          "type": "geo_shape"
      },
     "locality": {
  	   "type": "keyword"
       },
      "name": {
          "type": "keyword"
      },
      "bedrooms": {
          "type": "integer"
      },
      "livingrooms": {
          "type": "integer"
      },
      "price": {
          "type": "long"
      }
    }
  }

This is great for query speed for specifically listings OR (listings AND developments). However, if you need to search just developments then this adds a lot of overhead. In that case duplicating the data further into a development-only index makes sense. This maximizes the query performance. The downside is that there's more to maintain here in terms of how the data is loaded and how the mappings evolve. It sort of assumes you know your query use-cases well enough to do go ahead and pre-join in anticipation of those queries.

One great reason to consider a flattened listenings mapping like the one above AND a mapping that only contains developments that have at least one listing is if you have different views you need to drive in an application. Say you have an application entry-point with all of the developments that have listings as the default view. To make that as responsive as possible I would consider just an index of developments that have at least one listing and pre-join the data in your data processing to do so (as in insert / upsert to that index any time you see a listing for that development).

Alternatively you could take and model developments and listings as separate mappings and join them after performing two queries within your application logic. Frankly I don't typically recommend this unless you are building a reporting application with a lot of unknown queries and where slow performance is expected OR where the total data is relatively small in comparison to your application hardware. This definitely gives you the most flexible of the three options though.

Topic		Replies	Views
Include-in-all and include-in-root Elasticsearch	5	2752	July 6, 2017
Query all fields in an embedded document? Elasticsearch	10	1039	July 6, 2017
Nested object - include_in_parent vs include_in_root Elasticsearch	1	1191	July 6, 2017
Accessing nested document fields in nested queries Elasticsearch	4	508	July 6, 2017
Query both root and nested documents Elasticsearch	2	350	July 6, 2017

Nested Fields in an Index

Related topics