How to generate all permutations of child documents in Elasticsearcherch


(ergo) #1

Hi I have following schema of user objects:

curl -X PUT "localhost:9200/test_example" -H 'Content-Type: application/json' -d'
{
    "mappings": {
        "answer" : {
            "_parent" : {
                "type" : "user"
            }
        }
    }
}
'

curl -X PUT "localhost:9200/test_example/user/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "index": {"_id": 1} }
{ "user": 1, "foo": 1}
{ "index": {"_id": 2} }
{ "user": 2, "foo": 2}
{ "index": {"_id": 3} }
{ "user": 3, "foo": 1}
{ "index": {"_id": 4} }
{ "user": 4, "foo": 1, "answers": {"5": [6,3], "6":[6], "7":[6]}, "potential_approach": true}
'

curl -X PUT "localhost:9200/test_example/answer/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "index": { "parent": 1}}
{"question":5, "answer":6, "negative": false}
{ "index": { "parent": 1}}
{"question":5, "answer":3, "negative": false}
{ "index": { "parent": 1}}
{"question":6, "answer":6, "negative": false}
{ "index": { "parent": 1}}
{"question":7, "answer":6, "negative": false}
{ "index": { "parent": 2}}
{"question":5, "answer":6, "negative": false}
{ "index": { "parent": 2}}
{"question":5, "answer":3, "negative": false}
{ "index": { "parent": 2}}
{"question":6, "answer":1, "negative": false}
{ "index": { "parent": 2}}
{"question":7, "answer":2, "negative": false}
{ "index": { "parent": 3}}
{"question":5, "answer":6, "negative": false}
{ "index": { "parent": 3}}
{"question":5, "answer":2, "negative": false}
{ "index": { "parent": 3}}
{"question":6, "answer":4, "negative": false}
{ "index": { "parent": 3}}
{"question":7, "answer":3, "negative": false}
'

Now I want to create a count of all users split by their questions answers

So create a breakdown of users and all their question answers permutations:

{Q:1,A:1} and at same time {Q2, A:1}, then {Q:1,A:2} and at same time {Q2, A:1} ... and so on.

Conceptually something like this aggregation:

curl -X POST "localhost:9200/test_example/user/_search?pretty&size=1"
-H 'Content-Type: application/json' -d' {   "query": {},   "aggs": {
    "permutations": {
      "children": {
        "type": "answer"
      },
      "aggs": {
        "filtered": {
          "filter": {
            "bool": {
              "should": [],
              "must_not": [],
              "must": [
                {
                  "term": {
                    "question": 5
                  }
                }
              ]
            }
          },
          "aggs": {
            "5": {
              "terms": {
                "field": "answer",
                "size": 500
              },
              "aggs": {
                "filtered": {
                  "filter": {
                    "bool": {
                      "should": [],
                      "must_not": [],
                      "must": [
                        {
                          "term": {
                            "question": 6
                          }
                        }
                      ]
                    }
                  },
                  "aggs": {
                    "6": {
                      "terms": {
                        "field": "answer",
                        "size": 500
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }   } } '

I understand why this doesn't work, because one bucket excludes the documents that I would like to have in sub bucket, but I'm wondering if there are other approaches to the problem reverse nested aggregation could do this, but I have child documents instead of nested ones.

One approach that I could take would be maybe storing all the answers on user object in form that I added on user 4, then something like that would work fine:

curl -X POST "localhost:9200/test_example/user/_search?pretty&size=1" -H 'Content-Type: application/json' -d'
{
  "query": {},
  "aggs": {
    "filtered": {
      "filter": {
        "bool": {
          "should": [],
          "must_not": [],
          "must": []
        }
      },
      "aggs": {
        "5": {
          "terms": {
            "field": "answers.5",
            "size": 500
          },
          "aggs": {
            "filtered": {
              "filter": {
                "bool": {
                  "should": [],
                  "must_not": [],
                  "must": []
                }
              },
              "aggs": {
                "6": {
                  "terms": {
                    "field": "answers.6",
                    "size": 500
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
'

Following query would generate exactly what I want, my main issue is that there are 2000 question ID's so I'm worried that for few million user rows, Elasticsearch might blow up for me (some users might have answered 500-700 distinct questions). Are there any approaches I could use to work with the schema where answers are child documents? Or maybe there is a way to check if ES can handle documents with potentially many keys in them.


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.