Deduplication - Nested, Parent/Child OR None

mosiddi · July 6, 2015, 11:20am

Hi there,
We have a scenario where the documents we index might be present at multiple folders with same content. at indexing time, we can easily detect these duplicate as we have content hash. the duplication factors sometimes we see is >20. We don't store _source and we store only a few fields.

In normal cases (no duplicates) the size of index is reasonable. When we move to nested OR parent child and start storing _source the size grows >5 times and the query performance also slows down.

Say, for time, we don't care much about # of documents in index / size of index, doing no duplication and storing redundant data in ES is fine when compare to taking one of nested/ parent-child approach?

P.S. Our overall document size is way more than the stored fields size.

Regards,
Imran

mosiddi · July 7, 2015, 10:32am

Some more data -

I tried using nested format and the query performance was 10x slower. The major change was in the format (mapping) and I started using _source as I wanted to support upsert.
I tried using parent-child based format but there was a limitation - I had to fire more than one queries now. One for matching parents and next for related children for these parents.

Topic		Replies	Views
Data Duplication Model with Nested Docs Elasticsearch	1	699	July 28, 2017
Performance About Parent-Child vs Duplicating Data Elasticsearch	2	403	February 8, 2019
Parent/Child vs Nested. The real Performance difference Elasticsearch	1	1106	October 31, 2022
Tuning nested documents Elasticsearch	6	415	July 6, 2017
Has_parent query performance Elasticsearch	1	262	June 29, 2022

Deduplication - Nested, Parent/Child OR None

Related topics