Recommendations to organize index with Master-detail data to be queried with lots of aggregations?


(Xavi Ametller) #1

Hi everyone,

Disclosure

  • I'm new to ES (one week) so feel free to correct me anywhere.
  • I am running a proof of concept and have done zero perfomance optimization, I want to check if I'm on the right track before performance tunning.

My use case is as follows:

  1. Set of data which can be categorized as master-detail. There are around 4K masters with 10M details, so each master has on average 2500 details.
  2. I want to create a "search" page displaying possible facets to the user. For each facet I want to display the amount of Master elements which have refinements .
  3. I may have between 20-30 facets
  4. I want it fast :smile:

I toyed around with ES and started plain simple. I just create one document in ES foreach detail (so 10M documents in total). On each document I include both the master and the detail information (so yes, I repeat a lot of information 2500 times).
With this structure, I can create a query like this:

GET /{index}/{type}/_search
{    
  "size": { ... },
  "sort": { ... },
  "query": { ... },
  "aggs" {
    # This structure is repeated 20-30 times, once for each facet I want to allow filtering by
    "first_filter": {
        "terms": {
           "field": "first_filter_field"
         },
       "aggs": {
          "distinct_master_field_id": {
             "cardinality": {
                "field": "master_field_id"
        },
   "second_filter": {
        "terms": {
           "field": "second_filter_field"
         },
       "aggs": {
          "distinct_master_field_id": {
             "cardinality": {
                "field": "master_field_id"
        }
      }
      ....
    }
  }
}

Thing is, I'm far from the performance I would like to have, it takes seconds and I was heading for millis. My question is: Is this the best/good way to organize and query this information with such a use case?

Considerations:

  • [Parent child relationships][1] - This data structure seems to match my scenario, but I am not worried about modifying the index and am really concerned about performance. According to documentation performance is worse with parent-child structures
  • Am I abusing cardinality sub-aggregates? I'm always applying the same cardinality sub-aggregate... maybe there's a better way to organize the query
  • The option to grow horizontally is always there, don't want to go there from the very beginning (altough I may have to...)

Thank you all for your time. Any feedback will be appreciated!
Xavi
[1]: https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html


(system) #2