Calculate terms delta between outputs of 2 terms aggregation


(Zhengshencn) #1

Hi
I'm wondering if it's possible to perform 2 terms aggregation and get the topN delta terms (sorted by occurrence count) between the outputs, within one query step?

I have the the following data in my index which stores used words (in array word_list) in documents (represented by doc_id), which belongs to certain users (represented by user_id)

[
    {
        "user_id": 10,
        "doc_id": 200,
        "word_list": [
            "word1",
            "word2",
            "word3"
        ]
    },
    {
        "user_id": 10,
        "doc_id": 210,
        "word_list": [
            "word1",
            "word5"
        ]
    },
    {
        "user_id": 20,
        "doc_id": 401,
        "word_list": [
            "word2",
            "word5",
            "word6",
            "word7"
        ]
    }
]

I need to get the topN words which occurred in documents of users other than user_id 10. It basically means to get words that user_id 10 has never used in any of his/her documents, but at least one of others did. In above example, the expected final result is [word6, word7]).

So far I need to perform 2 terms aggregation to get full unique word list of user_id 10, and user_ids other than 10, and then do the delta in Java. The problem is that, to make sure I can get TopN, I have to firstly get the result of the aggs on user_id 10 (instead of perform those 2 in parallel), get the total number of unique words belongs to user_id 10 (say, the number of them is M), and then use M+N as size limit to perform next aggregation for word list of user_id != 10. In most of the cases the M could be very very large but N is usually just 20 or 40. I just feel that too many storage capability/network traffics are wasted in this way (tens of thousands words in interim results are returned to client side during the process).

By the way, even if I perform those 2 aggs in parallel (with size=0 in aggs to get full list of unique words), the overall performance is not good as well, I guess it just because too many words will be returned and calculated in client side.

Is there any possibility to fulfill my requirement within single query statement? Or is there any other optimistic way to do so?

Thank you in advance!

Zheng


(Zhengshencn) #2

Any comments, please?


(jigish thakar) #3

Why can't we just use where user_id != 10?


(system) #4