Word Count

Hi,

I am trying to do the following:

  1. Perform a standard search on a series of documents with a given search term. (easy enough)
  2. Count the amount of times another series of words appears within that result set.

The query is being performed on the "body_text" field.

Example:

Search for the word "cooking" and within the result set, count the amount of times the word "egg" appears.

Any ideas ?

One obvious solution is to perform two separate queries. But I am wondering if it's possible to do this in one call.

Thanks.

Hi shampoo,
You can see how any number of arbitrary queries overlap using the adjacency_matrix aggregation.
A visualization of the results might look like this:

Kibana-32

The circles and lines are sized by the numbers of documents with at least one occurrence (not the number of repeated occurrences within documents).

The query that provides the information behind this:

GET reviews/_search
{
  "size": 0,
  "timeout": "30s",
  "query": {
	"bool": {
	  "must": [
		{
		  "match": {
			"comments": "cooking"
		  }
		},
		{
		  "bool": {
			"should": [
			  {
				"match": {
				  "comments": "egg"
				}
			  },
			  {
				"match": {
				  "comments": "chips"
				}
			  },
			  {
				"match": {
				  "comments": "ham"
				}
			  }
			]
		  }
		}
	  ]
	}
  },
  "aggs": {
	"my_food_matrix": {
	  "adjacency_matrix": {
		"filters": {
		  "cooking ": {
			"match": {
			  "comments": "cooking"
			}
		  },
		  "egg": {
			"match": {
			  "comments": "egg"
			}
		  },
		  "chips": {
			"match": {
			  "comments": "chips"
			}
		  },
		  "ham": {
			"match": {
			  "comments": "ham"
			}
		  }
		}
	  }
	}
  }
}

Hi,

Thanks so much for the reply. If I understand correctly, this would return the number of documents in which the query successfully finds a match.. I would need to know the actual word count within those documents.

Thanks again

J

That's more expensive and generally something we don't offer - it could be skewed heavily by one spammy document that does keyword-stuffing.
That said, the information is stored in the index and if you want to deep-dive on that you can use the explain API to get the TF (term frequency) for a word in a doc amongst other scoring factors.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.