Sub aggregations on bucket key

Michael_Jackson1 · March 24, 2020, 2:05am

Hello everyone, new member here. I have been testing the path_hierarchy tokenizer. Just looking for advice with my situation.

I have a simple index with docs that look like this:

{
    "keyword": "stuff and things",
    "path": "/foo/bar/baz/",
    "views": 14,
    "members: 40,
    ....etc
}

The index contains a little over 1 million documents.
For the paths depth, there are 15 different level 1's, ~500 different level 2's and its keeps growing for each level of which all stop somewhere between 2-14 levels. Most of the data stops around level 7.
The goal is to be able to search through each level starting at the top 15 groups and be able to see the document counts as wells as aggregate the metrics associated.
The problem I am running into by just using path_hierarchy on all the data, I am running out of resources as the amount of data is rather large. I was trying to aggregate the top level and select one of the keys and use the path_hierarchy tokenizer to aggregate the next level down. So far I am unable to construct a query to do this, but I believe that I am going to run into a situation where I will return the higher levels as well.
I am still working on figuring this all out, but was hoping for a little insight from more experienced elasticsearch queriers. Any help/guidance would be greatly appreciated as I am not entirely sure how to approach this. I have thought about including a level attribute to the documents to limit the return to a range as I traverse down the paths. Again not sure if that is going to best approach and would love some input.
cheers!

spinscale · March 24, 2020, 4:35pm

Hey,

would it be possible to provide a full reproduction including index creation/mapping a few sample documents, sample queries - and what documents you expect/not expect back from those? This would make things a lot simpler to follow.

--Alex

Michael_Jackson1 · March 24, 2020, 5:59pm

Index creation:

PUT file_path
{
  "mappings": {
    "properties": {
      "keyword": {
        "type": "text"
      },
      "search_volume": {
        "type": "integer"
      },
      "cpc": {
        "type": "integer"
      },
      "score": {
        "type": "float"
      },
      "category": {
        "type": "keyword"
      },
      "category_path": {
        "type": "keyword"
      },
      "position":{
        "type": "float"
      },
      "url": {
        "type": "keyword"
      }
    }
  }
}

Some sample documents:

{
        "_index" : "file_path",
        "_type" : "_doc",
        "_id" : "Tgk1-nABTUHKytkMmmyr",
        "_score" : 10.267364,
        "_source" : {
          "keyword" : "mp4ba movies",
          "search_volume" : 20,
          "cpc" : 0,
          "score" : 24.26,
          "category" : "Movies",
          "category_path" : "/Business/Arts_and_Entertainment/Models/Individual/B/Bellucci,_Monica/Movies",
          "position" : 20.0,
          "url" : "http://mysitetester.com/movies"
        }
},
{
        "_index" : "file_path",
        "_type" : "_doc",
        "_id" : "5Ak1-nABTUHKytkMGw6m",
        "_score" : 10.267364,
        "_source" : {
          "keyword" : "marital infidelity movies",
          "search_volume" : 50,
          "cpc" : 0,
          "score" : 39.2635,
          "category" : "Movies",
          "category_path" : "/Arts/Movies/Studios/Warner_Bros./Movies",
          "position" : 17.0,
          "url" : "http://mysitetester.com/movies"
        }
 },
{
        "_index" : "file_path",
        "_type" : "_doc",
        "_id" : "8wk1-nABTUHKytkMGw6n",
        "_score" : 10.267364,
        "_source" : {
          "keyword" : "devotional movies",
          "search_volume" : 480,
          "cpc" : 0,
          "score" : 56.548,
          "category" : "Movies",
          "category_path" : "/Arts/Movies/Studios/Warner_Bros./Movies",
          "position" : 9.0,
          "url" : "http://mysitetester.com/movies"
        }
}

The query I am using to aggregate the top level of the category_path:

GET file_path/_search?size=0
{
  "aggs": {
    "tree": {
      "path_hierarchy": {
        "field": "category_path",
        "separator": "/",
        "max_depth": 0,
        "size": 30
      },
      "aggs": {
        "search_volumes": {
          "avg": {
            "field": "search_volume"
          }
        }
      }
    }
  }
}

And its response:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "tree" : {
      "buckets" : [
        {
          "key" : "World",
          "doc_count" : 308153,
          "path" : [ ],
          "search_volumes" : {
            "value" : 53.68258624774057
          }
        },
        {
          "key" : "Arts",
          "doc_count" : 206891,
          "path" : [ ],
          "search_volumes" : {
            "value" : 63.65835149909856
          }
        },
        {
          "key" : "Regional",
          "doc_count" : 107047,
          "path" : [ ],
          "search_volumes" : {
            "value" : 58.696553850177956
          }
        },
        {
          "key" : "Business",
          "doc_count" : 90660,
          "path" : [ ],
          "search_volumes" : {
            "value" : 51.84634899624972
          }
        },
        {
          "key" : "Computers",
          "doc_count" : 82783,
          "path" : [ ],
          "search_volumes" : {
            "value" : 56.790524624621
          }
        },
        {
          "key" : "Society",
          "doc_count" : 64092,
          "path" : [ ],
          "search_volumes" : {
            "value" : 58.82512638082756
          }
        },
        {
          "key" : "Games",
          "doc_count" : 57919,
          "path" : [ ],
          "search_volumes" : {
            "value" : 62.482432362437194
          }
        },
        {
          "key" : "Science",
          "doc_count" : 46828,
          "path" : [ ],
          "search_volumes" : {
            "value" : 55.13837874775775
          }
        },
        {
          "key" : "Reference",
          "doc_count" : 33955,
          "path" : [ ],
          "search_volumes" : {
            "value" : 55.791783242526876
          }
        },
        {
          "key" : "Sports",
          "doc_count" : 22052,
          "path" : [ ],
          "search_volumes" : {
            "value" : 60.563214220932345
          }
        },
        {
          "key" : "Recreation",
          "doc_count" : 18459,
          "path" : [ ],
          "search_volumes" : {
            "value" : 60.06121675063655
          }
        },
        {
          "key" : "Health",
          "doc_count" : 18455,
          "path" : [ ],
          "search_volumes" : {
            "value" : 67.62015713898673
          }
        },
        {
          "key" : "Shopping",
          "doc_count" : 11977,
          "path" : [ ],
          "search_volumes" : {
            "value" : 61.9195123987643
          }
        },
        {
          "key" : "Home",
          "doc_count" : 9212,
          "path" : [ ],
          "search_volumes" : {
            "value" : 64.33239253148068
          }
        },
        {
          "key" : "News",
          "doc_count" : 689,
          "path" : [ ],
          "search_volumes" : {
            "value" : 60.711175616835995
          }
        }
      ]
    }
  }
}

I am using this plugin:
https://github.com/opendatasoft/elasticsearch-aggregation-pathhierarchy

The query above is what I am using to see the top level by setting depth: 0. The goal is to be able to query through each level of a given top level. For example: /Business/Arts_and_Entertainment/Models/Individual/B/Bellucci,_Monica/Movies
/Business/Arts_and_Entertainment/Actors/Individual/H/Hanks,_Tom/Producer
/Business/Financial/Companies/Banks/Wells_Fargo
Going down from Business to Arts_and_Entertainment where the buckets returned would be
Arts_and_Entertainment and Financial with there respective document counts and whatever metrics that are added to the query (i.e. sum of score, cpc .... etc)

I have tried to sub aggregate a bucket using the bucket key returned, but have not got that query to work. What I have found is that I am not able to use _key with bucket_selector and bucket_path.

Let me know if there is anything else that would help explain my situation.
Thanks!

spinscale · March 30, 2020, 2:03pm

Ah, you're using a plugin, that is an important information.

I've never used it, and being an external plugin you could either hope for the authors to chime in or possible ping them in the respective github repo, if you do not get an answer here.

system · April 27, 2020, 2:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch - using the path hierarchy tokenizer to access different level of categories Elasticsearch	1	465	July 6, 2017
Path_hierarchy aggregation for specific depth Elasticsearch	1	464	February 8, 2018
How to use path_hierarchy tokenizer Elasticsearch	2	447	July 6, 2017
How do I index hierarchical data? Elasticsearch	3	5086	July 5, 2017
Path Hierarchy Tokenization - Exclude results below certain depth Elasticsearch	3	538	February 18, 2019

Sub aggregations on bucket key

Related topics