Hi I have following schema of user objects:
curl -X PUT "localhost:9200/test_example" -H 'Content-Type: application/json' -d'
{
"mappings": {
"answer" : {
"_parent" : {
"type" : "user"
}
}
}
}
'
curl -X PUT "localhost:9200/test_example/user/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "index": {"_id": 1} }
{ "user": 1, "foo": 1}
{ "index": {"_id": 2} }
{ "user": 2, "foo": 2}
{ "index": {"_id": 3} }
{ "user": 3, "foo": 1}
{ "index": {"_id": 4} }
{ "user": 4, "foo": 1, "answers": {"5": [6,3], "6":[6], "7":[6]}, "potential_approach": true}
'
curl -X PUT "localhost:9200/test_example/answer/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "index": { "parent": 1}}
{"question":5, "answer":6, "negative": false}
{ "index": { "parent": 1}}
{"question":5, "answer":3, "negative": false}
{ "index": { "parent": 1}}
{"question":6, "answer":6, "negative": false}
{ "index": { "parent": 1}}
{"question":7, "answer":6, "negative": false}
{ "index": { "parent": 2}}
{"question":5, "answer":6, "negative": false}
{ "index": { "parent": 2}}
{"question":5, "answer":3, "negative": false}
{ "index": { "parent": 2}}
{"question":6, "answer":1, "negative": false}
{ "index": { "parent": 2}}
{"question":7, "answer":2, "negative": false}
{ "index": { "parent": 3}}
{"question":5, "answer":6, "negative": false}
{ "index": { "parent": 3}}
{"question":5, "answer":2, "negative": false}
{ "index": { "parent": 3}}
{"question":6, "answer":4, "negative": false}
{ "index": { "parent": 3}}
{"question":7, "answer":3, "negative": false}
'
Now I want to create a count of all users split by their questions answers
So create a breakdown of users and all their question answers permutations:
{Q:1,A:1} and at same time {Q2, A:1}, then {Q:1,A:2} and at same time {Q2, A:1} ... and so on.
Conceptually something like this aggregation:
curl -X POST "localhost:9200/test_example/user/_search?pretty&size=1"
-H 'Content-Type: application/json' -d' { "query": {}, "aggs": {
"permutations": {
"children": {
"type": "answer"
},
"aggs": {
"filtered": {
"filter": {
"bool": {
"should": [],
"must_not": [],
"must": [
{
"term": {
"question": 5
}
}
]
}
},
"aggs": {
"5": {
"terms": {
"field": "answer",
"size": 500
},
"aggs": {
"filtered": {
"filter": {
"bool": {
"should": [],
"must_not": [],
"must": [
{
"term": {
"question": 6
}
}
]
}
},
"aggs": {
"6": {
"terms": {
"field": "answer",
"size": 500
}
}
}
}
}
}
}
}
}
} } } '
I understand why this doesn't work, because one bucket excludes the documents that I would like to have in sub bucket, but I'm wondering if there are other approaches to the problem reverse nested aggregation could do this, but I have child documents instead of nested ones.
One approach that I could take would be maybe storing all the answers on user object in form that I added on user 4, then something like that would work fine:
curl -X POST "localhost:9200/test_example/user/_search?pretty&size=1" -H 'Content-Type: application/json' -d'
{
"query": {},
"aggs": {
"filtered": {
"filter": {
"bool": {
"should": [],
"must_not": [],
"must": []
}
},
"aggs": {
"5": {
"terms": {
"field": "answers.5",
"size": 500
},
"aggs": {
"filtered": {
"filter": {
"bool": {
"should": [],
"must_not": [],
"must": []
}
},
"aggs": {
"6": {
"terms": {
"field": "answers.6",
"size": 500
}
}
}
}
}
}
}
}
}
}
'
Following query would generate exactly what I want, my main issue is that there are 2000 question ID's so I'm worried that for few million user rows, Elasticsearch might blow up for me (some users might have answered 500-700 distinct questions). Are there any approaches I could use to work with the schema where answers are child documents? Or maybe there is a way to check if ES can handle documents with potentially many keys in them.