Query hierarchical data

I'm working in creating and simplifying probabilistic topic models for large corpora of data. Every document I would index will contain a field having the following format:

...
"topics" : {
"level0" : ["keywords"],
"level1" : ["keywords"],
"level2" : ["keywords"]
}
...

I would want to make a query that, given a document id (let us call this doc D), would give me the documents (let us call all possible hits H) which are similar based on this field. In order to get a match, one of all level keywords from D has to be present in any of the levels. Then, the score of each hit should be higher if the keyword they share is at a lower level in D. It's should get higher if the keyword they share is at a lower level in H.

I'm currently using the following query

"query": {
	"bool" : {
		"should" : [
	{
		"multi_match" : {
			"boost" : 3,
			"query":    "Keyword_A", 
			"fields": [ "topics.l0", "topics.l1", "topics.l2" ] 
	} },
	{
		"multi_match" : {
			"boost": 2,
			"query":    "Keyword_B", 
			"fields": [ "topics.l0", "topics.l1", "topics.l2" ] 
	} },
	{
		"multi_match" : {
		"query":    "Keyword_C", 
		"fields": [ "topics.l0", "topics.l1", "topics.l2" ] 
	}
	}
			
		]
	}
	
} #for now, Keywords_[A, B, C] are taken from **D** manually as I don't know how to fetch this fields directly into a query

In combination with index boost in each of the field.

Is there a better way for me to define this query or the score?

Thanks in advance

If the data is held in a hierarchy the "trunk" branches near the root of the tree will appear in the index more frequently than the deeper branches towards the leaf end of the tree (you have to traverse from the root to leaves of the tree after all).
All of this means that Lucene's natural tendency to reward rare terms over common terms (aka IDF) should be taking effect in searches and you shouldn't need to supply boosts for the different levels - it should already know how much they are worth. Is this not the case in your tests?

The data is a hierarchy in the sense that the lower levels of keywords are more representative than the upper levels, but, for now, the structure it follows is the same as shown at the beginning of this post. As we are only interested in boolean match in which the keyword is present or not, IDF is not the type of scoring I'm looking for. Is there a better way to the query without using both query time and index time boosting? Maybe using a general multiterm followed by a scoring function?

Essentially providing true or false decisions for docs then?

So the above is no longer a requirement?

A mix of both. We are three levels and in each one of them, there is an array of keywords. The problem is ranking the documents that share ANY of the keywords at any level. So that keyword share is a boolean match, the document either has a keyword in common or it doesn't. But, in order to rank the documents the score for each match should take into account in which level (remember, the lower the level, the more important a keyword is for the document) does the shared keyword lies in both the source and target documents.

As an example, think of two documents A, B and C with the following keyword structure:

 A.topics = { l0=[T1], l1=[T2], l2=[T3] }
 B.topics = { l0=[T2], l1=[T1], l2=[T3] }
 A.topics = { l0=[T1], l1=[T3], l2=[T2] }

When retrieving similar documents to A, even though all documents share the same keywords, the doc C should be a better match cause the level of the topics shared is in a lower level than B

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.