Sum distinct numeric values in object datatype


#1

Hello,
I'm now indexing some newspaper documents. I have created some object fields in my index that collect all the entities founded in the text field and its frequencies in the same text field, E.G.:

          "people": [
      {
         "count": 2,
         "value": "Ermanno"
      },
      {
         "count": 2,
         "value": "Anna Finocchiaro"
      },
      {
         "count": 2,
         "value": "Roberto Calderoli"
      },
      {
         "count": 2,
         "value": "Silvio Berlusconi"
      },
      {
         "count": 2,
         "value": "Denis Verdini"
      },
      {
         "count": 2,
         "value": "Paolo Romani"
      },
      {
         "count": 2,
         "value": "Juncker"
      },
      {
         "count": 2,
         "value": "Federica Mogherini"
      },
      {
         "count": 4,
         "value": "Angela Merkel"
      },
      {
         "count": 2,
         "value": "Matteo Renzi"
      },
      {
         "count": 2,
         "value": "Junker"
      },
      {
         "count": 2,
         "value": "Beppe Grillo"
      },
      {
         "count": 4,
         "value": "Giancarlo Galan"
      },
      {
         "count": 2,
         "value": "Myrta Merlino"
      },
      {
         "count": 2,
         "value": "Yara Gambirasio"
      },
      {
         "count": 2,
         "value": "Francesco Dettori"
      },
      {
         "count": 2,
         "value": "John Kerry"
      },
      {
         "count": 2,
         "value": "Obama"
      },
      {
         "count": 2,
         "value": "Putin"
      },
      {
         "count": 2,
         "value": "Kuchma"
      },
      {
         "count": 6,
         "value": "Prandelli"
      },
      {
         "count": 2,
         "value": "Cesare"
      },
      {
         "count": 2,
         "value": "Chiellini"
      },
      {
         "count": 2,
         "value": "Pirlo"
      },
      {
         "count": 2,
         "value": "Balotelli"
      }
   ]

the mapping settings are these ones:

          "people":{
              "properties": {
                          "count": {
                                      "type": "integer",
                                      "doc_values": true,
                                      "index": true
        },
                        "value":{  
                                     "type": "text",
                                     "analyzer": "namedentities_analyzer",
                                    "fielddata": true
        }
                }
          }

Now I would like to make a query to retrieve all the people entity found in all the newspapers of the last two months and order them by the sums of their frequencies count. Something like this:

value: Bergoglio, sum_of_count: 256, doc_count:200
value: Berlusconi, sum_of_count: 239, doc_count: 180,
etc....
How i can do that? I have to change my data structure?


(Xavier Facq) #2

Hi,

In order to use aggregations, I think you'll have to convert your field "value" to not_analyzed ?
After, you should try to make some test queries and play with Aggregations.

bye,
Xavier


#3

HI, I have changed my index mapping according your suggestion, but unfortunately the result doesn't change. I have tried to produce the desired output concatenating two aggregation, the first one a term aggregation on location.value and for the second one I have tried sum aggregation, cardinality, and value_count, but I can't understand the results:

{
"_source": {
"includes": [ "organization.value", "organization.count" ],
"excludes": [ "text","vectterms", "sentence.sentence_text"]
},
"size": 0,
"aggs": {
"group_by_org": {
"terms": {
"field": "organization.value"
} ,
"aggs": {
"count_sum": {
"cardinality": {
"field": "organization.count"
}}}}}}

with cardinality aggregation, I have this strange output, where sometimes the value of the second aggregation is less than the value of doc_count - what is this value?:

"aggregations": {
"group_by_org": {
"doc_count_error_upper_bound": 2,
"sum_other_doc_count": 91,
"buckets": [
{
"key": "Consiglio comunale",
"doc_count": 7,
"count_sum": {
"value": 6
}
},
{
"key": "AA",
"doc_count": 5,
"count_sum": {
"value": 6
}
},
{
"key": "Cassazione",
"doc_count": 5,
"count_sum": {
"value": 6
}
},
{
"key": "Corte dei conti",
"doc_count": 5,
"count_sum": {
"value": 6
}
},
{
"key": "Metroweb",
"doc_count": 5,
"count_sum": {
"value": 6
}
},
{
"key": "Milan",
"doc_count": 5,
"count_sum": {
"value": 6
}
},
{
"key": "Minardi",
"doc_count": 5,
"count_sum": {
"value": 6
}

the other two aggregations tried, sum and value_count, return to me a very high value of this aggregation - In the same way of the previous query, replacing cardinality with sum and value_count, I think this is the number of all the organization entity in all document where appears that entity, but I can't understand well:

output of sum aggregation - I have only 9 documents in my index, and each entity return less than 10 times per document :

"buckets": [
{
"key": "Consiglio comunale",
"doc_count": 7,
"count_sum": {
"value": 332
}
},
{
"key": "AA",
"doc_count": 5,
"count_sum": {
"value": 301
}
},
{
"key": "Cassazione",
"doc_count": 5,
"count_sum": {
"value": 301
}
},
{
"key": "Corte dei conti",
"doc_count": 5,
"count_sum": {
"value": 301

output of value_count:

"buckets": [
{
"key": "Consiglio comunale",
"doc_count": 7,
"count_sum": {
"value": 105
}
},
{
"key": "AA",
"doc_count": 5,
"count_sum": {
"value": 80
}
},
{
"key": "Cassazione",
"doc_count": 5,
"count_sum": {
"value": 80
}
},
{
"key": "Corte dei conti",
"doc_count": 5,
"count_sum": {
"value": 80
}

Can someone help me?


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.