Significant term aggregation on nested query


#1

In my index I have indicized 2 times a text: the first field, called keywordtext, represents the entire text as keyword datatype, the second one, called sentences, is a nested datatype with the text divided into sentences and every sentence has his start time and end time. I would like to query a term into my dataset and then I would like to make significant query aggregation to retrieve the words correlated to my searched term. In Es 2.x I could make this operation making the entire text field a string datatype with termvector active. With Es 5.x I try to make a fulltext query against the nested field and then make a significant term aggregation, but it seems not working without errors:

        "query": {
  "bool": {
     "must": [
        {
           "nested": {
              "path": "sentences",
              "query": {
                 "match": {
                    "sentences.value": {
                       "query": "stato"
                    }
                 }
              }
           }
        },
        {
           "range": {
              "date": {
                 "from": "19/09/1989",
                 "to": "14/02/2017",
                 "format": "dd/MM/yyyy"
              }
           }
        }
     ]
  }
 },
 "aggs": {
  "frequentTerms": {
     "significant_terms": {
        "field": "keywordtext"
     }
  }
 }
}

How i can make the same operation with this version of elastic?


(Mark Harwood) #2

Try wrapping in the nested agg:

DELETE test
PUT test
{
  "settings": {
	"number_of_shards": 1,
	"number_of_replicas": 0
  }, 
  "mappings": {
	"doc": {
	  "properties": {
		"play": {
		  "type": "keyword"
		},
		"author": {
		  "type": "keyword"
		},
		"sentences":{
		  "type":"nested",
		  "properties": {
			"text":{
			  "type":"text",
			  "fielddata":true
			}
		  }
	  
		}
	  }
	}
  }
}
POST /test/doc/_bulk
{"index": {}}
{"sentences" : [ {"text" : "filler"}]}
{"index": {}}
{"sentences" : [ {"text" : "filler"}]}
{"index": {}}
{"sentences" : [ {"text" : "filler"}]}
{"index": {}}
{"sentences" : [ {"text" : "filler"}]}
{"index": {}}
{"sentences" : [ {"text" : "filler"}]}
{"index": {}}
{"sentences" : [ {"text" : "filler"}]}

POST test/doc
{
  "play": "macbeth",
  "author": "shakespeare",
  "sentences": [
	{
	  "text": "funny - there's a knife missing from the dishwasher"
	},
	{
	  "text": "where'd you get that knife?"
	},
	{
	  "text": "put the knife down - you'll hurt someone you bloody fool"
	}

  ]
}
GET test/_search
{
  "query": {
	"match": {
	  "author": "shakespeare"
	}
  },
  "size": 0,
  "aggs": {
	"shakespeare keywords": {
	  "nested": {
		"path": "sentences"
	  },
	  "aggs": {
		"foo": {
		  "significant_terms": {
			"field": "sentences.text"
		  }
		}
	  }
	}
  }
}

#3

Thanks for the answers but I would like to avoid fielddata, for the RAM consumption. I have to manage a very large dataset


(Mark Harwood) #4

Funny you should mention that. I'm currently working on adding exactly that in the form of a new significant_text aggregation

Unlike significant_terms it does not rely on fielddata and can strip out noisy repeated text that otherwise skews stats.
The bad news for you I suspect is that it will not work on nested docs.

Cheers
Mark


#5

Thanks Mark, That's a good news ! Your current work is very interesting for my job, also the representation of dbpedia entities on graph, last week, was very useful. Have you a blog where I can follow you?


(Mark Harwood) #6

Good to hear. Please add comments to the github issue for significant_text if you have any suggestions.

I don't have a blog but I have various demos on a Youtube channel
I hope to add a video on the new significant_text agg once it's merged into the master branch.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.