Best practice for data model of geo data - less objects with nested vs. more objects with duplication

gmmorris · May 14, 2023, 11:48am

Hello, my dear Elasticians, I miss you dearly.

On my new adventure, I encountered a data modelling dilemma and thought I'd ask the experts what they think.

We're ingesting large data sets of geospatial data and we're trying to identify the ideal way in which to normalise this data in ES.

Specifically, I have this dilemma.

Assuming our schema allows for input like this, where the entities field can be a variable length array and we expect requests to vary widely in the number of entities sent...

{
	"location": {
		"type": "Polygon",
		"coordinates": [
			[
				[100.0, 0.0],
				[101.0, 0.0],
				[101.0, 1.0],
				[100.0, 1.0],
				[100.0, 0.0]
			]
		]
	},
	"what": {
		"entities": [{
			// all entities share the same schema
			// some requests will have just a handful of entities
			// but some might have 100s of entities in this array
			"entity_name": "...",
			"many_more_field": "..."
		}]
	}
}

... are we better off:

Storing the entities field as a nested field and storing each geo feature as a single large document?
Normalising our documents such that each document contains only one entity, and we duplicate all the other data (meaning there are multiple geo features that intersect, rather than multiple entities in a single feature)?

It's obvious to me that (1) is better from a:
a. storage perspective
b. effort perspective, as it means I can easily tell that entities originate in the same request and only have one geo feature to think about

But lets assume my main concern is optimising for read performance, and that I can rely on the geo data (intersection queries) and the time span to correlate documents - would it make sense to follow approach (2) and duplicate data rather than have a potentially huge nested field?

Any advice would be highly appreciated.

Thanks!

system · June 11, 2023, 11:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic Performance for Spatial Queries Slows Down when Geometries are in Millions Elasticsearch	2	201	February 27, 2024
Error when saving Geo Shape defined in multiple fields Elasticsearch	1	37	November 18, 2024
Geo_point fielddata Elasticsearch	3	727	July 5, 2017
Data redundancy with a single nested level vs multilevel nested document Elasticsearch	6	550	August 12, 2021
Index nested documents separately? Elasticsearch	4	1219	July 6, 2017

Best practice for data model of geo data - less objects with nested vs. more objects with duplication

Specifically, I have this dilemma.

Related topics