Best practice for data model of geo data - less objects with nested vs. more objects with duplication

Hello, my dear Elasticians, I miss you dearly. :wave:

On my new adventure, I encountered a data modelling dilemma and thought I'd ask the experts what they think.

We're ingesting large data sets of geospatial data and we're trying to identify the ideal way in which to normalise this data in ES.

Specifically, I have this dilemma.

Assuming our schema allows for input like this, where the entities field can be a variable length array and we expect requests to vary widely in the number of entities sent...

{
	"location": {
		"type": "Polygon",
		"coordinates": [
			[
				[100.0, 0.0],
				[101.0, 0.0],
				[101.0, 1.0],
				[100.0, 1.0],
				[100.0, 0.0]
			]
		]
	},
	"what": {
		"entities": [{
			// all entities share the same schema
			// some requests will have just a handful of entities
			// but some might have 100s of entities in this array
			"entity_name": "...",
			"many_more_field": "..."
		}]
	}
}

... are we better off:

  1. Storing the entities field as a nested field and storing each geo feature as a single large document?
  2. Normalising our documents such that each document contains only one entity, and we duplicate all the other data (meaning there are multiple geo features that intersect, rather than multiple entities in a single feature)?

It's obvious to me that (1) is better from a:
a. storage perspective
b. effort perspective, as it means I can easily tell that entities originate in the same request and only have one geo feature to think about

But lets assume my main concern is optimising for read performance, and that I can rely on the geo data (intersection queries) and the time span to correlate documents - would it make sense to follow approach (2) and duplicate data rather than have a potentially huge nested field?

Any advice would be highly appreciated. :slight_smile:

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.