I have a question about performance when mapping documents with a parent-child relationship and a large number of children. The parent represents an entity, and children are sets of time series related to the parent. In total I have 3.5 million parent documents (~4.5GB storage) and ~400 million child documents (a couple sets of time series, 280GB size). Users apply a number of facets to parent documents and then view aggregations of either parent document attributes or time series child attributes.
~50% of the time, the user only needs to see filtered parent object aggregations. Most importantly, these parent documents are used in a mapping application, so queries, faceting, aggregation need to be blazing fast.
The other 50% of the time, when users look at the child time series data, they generally reduce the document set by applying filters to parent documents until the result set is <1% of total documents, then look at different time series aggregations on child documents. Importantly, all filters, etc. are applied to parent documents. There are no analyzed string fields, etc. in child documents.
My first approach to solving these issues was unsuccessful. I started by creating one index and mapping the child documents as Nested type, but soon found parent-only query/aggregation performance degraded. I found my fielddata (even using doc_values) and filter cache memory usage to be very high while retrieval of parent-only aggregations appeared to get slower and slower linearly with the increase in total document count.
Given these issues, I tried a different approach. I created two indexes on the same node. The first has parent-only mappings and no child documents. Fielddata and filter cache memory is tiny given the small document set size, and aggregations, etc. are much faster and more predictable. The second index includes both parent and child documents, which is used only when the user actually requires seeing aggregations on the nested documents. There is almost no need for most fielddata on this second index. Once the user selects facets on the parent documents, I simply apply those filters to the second index (mostly index=not_analyzed string fields). This does result in a large filter cache on the second index, which makes sense given the bitset grows linearly with the number of documents.
Hardware-wise, Im running a single node while in dev mode. I assume I could throw more hardware at the issue, but I am most interested in knowing:
-
Is there a more efficient way to guarantee high parent-document only performance for queries, filters, aggregations? For example, would it make sense to have a dedicated node just for the parent-only index, so all resources and even the OS cache are dedicated to the parent documents? With 3.5 million documents (~ 4.5GB storage) I assume I could have a node keep the entire document set in RAM on an 8GB machine.
-
Am I correct in assuming my first approach was slower for answering parent-only queries because nested documents are mapped as separate objects, so the first filter bitset immediately has to filter out all child documents (400 million) before even starting to apply the parent query? Even though filters are fast, I assume they still add some overhead vs an index of only parent documents?
-
I could imagine the parent-child relationship more naturally describes this data set. But I am more concerned with performance and therefore decided to use the nested relationship. Could this be an issue?