What to do when flattening data is too expensive (another join question)

Ian_Simpson · July 24, 2024, 8:10pm

We are in the process of brainstorming ways to redesign some of our indices. We have a 3-tier data structure that looks something like this:

Root Entity (light text meta-data with some dates)
|- File Attachments (large text chunks ingested from PDFs, Word, etc)
|- Sub Entity (light text meta with dates and numerical data to aggregate over)
   |- Sub-Sub Entity (light text meta with dates and numerical data to aggregate over)

The total number of Sub and Sub-Sub entities under a Root may add up to more than 10k. However, the File Attachments may be limited in number to less than 10k.

We have a requirement to group and filter by what's in the Root Entity, plus filter by what's in the attachments, while aggregating over the numerical data in the Sub or Sub-Sub entity.

My proposed solution to have separate indices for Root, Sub, and Sub-Sub entities, but include File Attachments as nested documents under the Root. The Sub and Sub-Sub indices will also include nested Root documents them that are copies of the Root entity they belong to. This will allow me to filter and group by Root data when looking at Sub and Sub-Sub aggregates.

However, I will not be able to include File Attachments as nested docs in the Sub and Sub-Sub indices, because this would require too much disk space. My plan to enable searching over the file data is to impose a two step process where I search the attachment data first to build a list of Root entity IDs which are then fed into the search over the Sub data to accomplish the filtering. This is fine, but I will need to use the scroll, or pagination features to crawl through all possible hits to handle the situation where my file attachment query exceeds 10k hits.

Given my situation, is this the best solution? IE - is a two step query with a massive id array fed into a filter clause the best solution when you need to accomplish a "join" and the data being searched is too large to flatten into whatever data you're aggregating? Also, is including parents as nested docs the best solution when children exceed the 10k nested doc limit?

Topic		Replies	Views
Best way to implement relationship Elasticsearch	1	473	May 26, 2020
Best practices for related and hierarchical data Elasticsearch	4	15768	July 5, 2017
Sub aggregations on bucket key Elasticsearch	4	563	April 27, 2020
How to potentially flatten a nested document design? Elasticsearch	3	866	February 10, 2020
How to deal with splitted docs? Elasticsearch	1	346	March 10, 2020

What to do when flattening data is too expensive (another join question)

Related topics