Deduplication - Nested, Parent/Child OR None

(Imran Siddique) #1

Hi there,
We have a scenario where the documents we index might be present at multiple folders with same content. at indexing time, we can easily detect these duplicate as we have content hash. the duplication factors sometimes we see is >20. We don't store _source and we store only a few fields.

In normal cases (no duplicates) the size of index is reasonable. When we move to nested OR parent child and start storing _source the size grows >5 times and the query performance also slows down.

Say, for time, we don't care much about # of documents in index / size of index, doing no duplication and storing redundant data in ES is fine when compare to taking one of nested/ parent-child approach?

P.S. Our overall document size is way more than the stored fields size.


(Imran Siddique) #2

Some more data -

  1. I tried using nested format and the query performance was 10x slower. The major change was in the format (mapping) and I started using _source as I wanted to support upsert.
  2. I tried using parent-child based format but there was a limitation - I had to fire more than one queries now. One for matching parents and next for related children for these parents.

(system) #3