Sort based on aggregation


(Cristiana Tiugan) #1

Hello

I am having an issue with sorting documents based on aggregation result.
The scenario is the following:
I have an index called messages.
A message can have multiple messages. (one level a message can have only children, not grandsons)
I didn't make a relation child-parent, but I have a column called resourceId that represents the parent.
I want to filter all the parent messages return them based on the number of children.

I tried first to filter the results and then sort them by making a new request with a term aggregation based on resourceId, but what if the result that is most popular isn't in my filtered results? because I also have set a size?

I also tried first to make the aggregation, but again I can't actually filter the results because the filtering is made on parent messages.

Any other ideas that might help me?

PS: I am using elasticsearch 2.3


(Christoph) #2

Hi,

I'm trying to understand your problem, but can you try to explain again why using a terms aggregation on the child documents "resourceId" field is not working? As far as I understand your example, you are trying to answer the question "Give me the N parent messages that have most children". That should work with a terms aggregation using the size parameter. I didn't understand the additional filtering part. Can you give a concrete example, maybe with some small example documents that have a similar structure like the ones you are using?


(Cristiana Tiugan) #3

This is my index structure:

 "messageboard": {
      "mappings": {
         "message": {
            "dynamic": "strict",
            "_all": {
               "enabled": false
            },
            "dynamic_date_formats": [
               "date_hour_minute_second_millis"
            ],
            "properties": {
               "body": {
                  "type": "string",
                  "index": "no",
                  "fields": {
                     "contains": {
                        "type": "string",
                        "analyzer": "index_fulltext_analyzer",
                        "search_analyzer": "search_fulltext_analyzer"
                     }
                  }
               },
               "bodyWordCount": {
                  "type": "integer"
               },
               "contextId": {
                  "type": "string",
                  "index": "not_analyzed"
               },
               "resourceId": {
                  "type": "string",
                  "index": "not_analyzed"
               },
               "createdDate": {
                  "type": "date",
                  "format": "strict_date_optional_time||epoch_millis"
               },
               "deletedBy": {
                  "type": "string",
                  "index": "not_analyzed"
               },
               },
               "ownerUserId": {
                  "type": "string",
                  "index": "not_analyzed"
               }
      }
}
}

I need to be able to also filter the messages, let's say return the documents that belong to ownerUserId, ar not deleted.
I identify the parents by not having resourceId and identify the children by having resourceId
I tried this query and it didn't work:

GET messageboard/message/_search 
{
    "query": {
      "bool": {
        "filter": [
          {
            "bool": {
              "must": [
                {
                  "not": {
                    "term": {
                      "authorId": "1"
                    }
                  }
                },
                {
                  "term": {
                    "contextId": "a512db2741cd20693e4b16f19891e72b9ff12cead72761fc5e92d2aaf34740c1"
                  }
                },
                {
                  "not": {
                    "exists": {
                      "field": "resourceId"
                    }
                  }
                }
              ]
            }
          },
          {
            "bool": {
              "must_not": [
                {
                  "term": {
                    "deletedBy": "161"
                  }
                }
              ]
            }
          }
        ]
      }
    },
     "aggs": {
      "message_count": {
        "terms": {
          "field": "resourceId",
          "size": 15
        },
        "aggs": {
          "message_last_top": {
            "top_hits": {
              "size": 2,
              "sort": {
                "createdDate": "desc"
              }
            }
          }
        }
      }
    },
    "size": "30"
}

Because the first query eliminates the documents that have resourceId, and the message_count needs the resourceId


(Christoph) #4

Hi,

I took the liberty to edit your post to have some structure for the json part. You can do that by inserting 3 backticks (`) before and after your code.

From what I can see, the problem in your approach so far is that you haven't really modelled the parent-child relationship but use the "resourceId" field for that. For the aggregation to work you cannot filter out only the documents without the "resourceId". If the "child" documents also have the same criteria you filter on (i.e. authorId, contextId etc.), can you aggregate only on the child-documents (those that have 'resourceId'? That would give you the top N parent messages, which you would still have to fetch but that should be a relatively quick second query.
On a related note, the query part looks a bit complicated, which version of ES are you using? Are you sure the filtering part works correctly?


(Cristiana Tiugan) #5

I know it is a bit complicated, because I am building the queries dynamically, it is on the list to improve the process.
The filtering part works correctly,
Not really, I want the parent messages that have me as author (for example). This doesn't mean that the replies are mine


(Christoph) #6

From all that I see here so far, I would suggest you investigate using the has_child query although this will mean you will have to start modelling your data using a parent-child relationship. Other workarounds will most likely mean you have to use several consecutive queries and combine the results manually.

One such workaround could be:

  • first get the "parten" messages that satisfy your filter criterion
  • then use their Ids as a filter on the "resourceId" on the child messages and aggregate those on "resourceId" to get the top N parent ids
  • maybe you also need to fetch those again

This is a rough sketch, I doubt you can do this with only one query. Even with parent-child relationship you might need two queries, but I'm interested in anybody elses opinion on this.


(Cristiana Tiugan) #7

Yes, the workaround solution I've implemented already, but I wasn't sure about it. I will talk with my team.
If I can't find any other solution, I will think about modelling a parent-child relationship.

Thanks a lot for your help!


(Christoph) #8

Hi,

If you already have a working solution I would also test if modelling parent-child is really worth the effort, maybe things get overly complicated that way and issuing a few queries instead is good enough in terms of performance.


(system) #9