Determine Top Keywords from the fetched list of documents

This is a continuation of what I've posted months ago, but was not able to comment anymore because the post got locked.

First, I would like to thank sir @Dadoonet for helping me out the last time, I was able to maximize the ES feature called explain, I just made a recursive function to fetch the specific "frequencies" array index from the ES response.

Now this time my specific requirements are, I would like to determine which of my keywords would have the top number of occurrence.

[
    "query_string" => [
        "default_field" => $content,
        "query" => $keywords
    ]
],

the $keywords variable contains:

("MCU" OR "Marvel" OR "Spiderman")

The problem with the explain feature is that it counts the frequency of these words occurred in EACH AND EVERY document but NOT from the WHOLE LIST OF DOCUMENTS received by the search query.

SOLUTIONS I'VE TRIED

SOLUTION 1:
Since I'm using Laravel (a PHP based Framework), I've thought of looping throughout the ['hits']['hits'] part of the response.

But I've thought of, what if I'll get 10,000 documents (the default maximum number of size)? That would be too much memory intensive for the server to loop throughout the ['hits']['hits'] and perform my recursive function just to sum all their frequencies.

Is there an easier way for me to determine the top keywords from my list of fetched documents?

Check out the ‘significant text’ aggregation.

You can give it words of interest in the ‘include’ clause or let it discover the interesting words in your result set (the usual use case)
For each word of interest it will give:

  • the “foreground” count (number of docs matching the query with the word) and
  • the background count (number of docs in the index with the word)

Awesome! Thank you so much for your response sir @Mark_Harwood!

However, the response of the significant_text aggs also includes other "significant_terms" as said by the response but for me, as the developer and for the client, they're not significant at all

And when I included the $keywords, it displays nothing

image

image

Is there any other option, sir?

Can you share the query? (Preferably formatted JSON, not images)

I'm now using a different user together with his saved filters, please refer to the aggs portion sir

{"index":"web","type":"index","body":{"query":{"bool":{"must":[{"query_string":{"default_field":"content","query":"(("Mayor Isko Moreno" OR "Mayor Vico Sotto") AND ("Manila" OR "Pasig"))"}},{"range":{"pub_date":{"from":1569283200,"to":1569888000}}},{"bool":{"must_not":[{"terms":{"mty_id":[14,15,16]}}]}},{"bool":{"should":[{"bool":{"must_not":[{"nested":{"path":"blk","query":{"bool":{"filter":{"match":{"blk.cli_id":"4599"}}}}}},{"nested":{"path":"blk","query":{"bool":{"filter":{"match":{"blk.kgp_id":1738}}}}}}]}}]}},{"bool":{"must_not":[{"terms":{"pub_id":["35209","35232","35268","35270","35296","35297","35298","35299","35300","35319","35374","35375","35376","35377","35382","35383","45208"]}}]}}]}},"aggs":{"KEYWORDS":{"significant_text":{"field":"content","filter_duplicate_text":true,"include":"(("Mayor Isko Moreno" OR "Mayor Vico Sotto") AND ("Manila" OR "Pasig"))"}}},"sort":{"pub_date":{"order":"desc"}}}}

Thanks for that. Your "include" clause will need fixing.
It acts as a filter on the indexed tokens found in the documents (so these will typically be lowercased words and possibly stemmed depending on your choice of analyzer).

If you pass a string in the include clause it is interpreted as a regex.
If you pass an array of strings they are interpreted as exact-match.
It looks like you are passing a single, long, mixed case string which won't match any of the words.

You may want to use the analyze api to convert your search terms like Isco Moreno into individual terms to use in this filter.

1 Like

Many of these keywords look to have come from the same press release

The filter_duplicate_text setting is designed to help tune that sort of stuff out.
Another significant text configuration that can help quality and speed of keyword discovery is to use a parent sampler aggregation to only look at a sample of high-quality matches. With large result sets it's also required to limit RAM use

1 Like

Again, I would like to thank you for your heartfelt willingness to help, I've been really stuck about this problem for a quite a time now

so do you mean sir my $keywords variable should look like an array of strings? like this? ["isko", "moreno"]

The problem with my $keywords variable is, they're actually a data from a MySQL database that are called client_keywords and when I go to this certain app, I just collect these client_keywords and concatenate them to become a string so I could search for them in the query_string query, is my understanding correct sir @Mark_Harwood?

and what's making this harder, our legacy app came from a Sphinxsearch full text server, we're just upgrading and migrating to Elasticsearch, so that's why we're transferring little by little, the client_keyword of this client looks like this

image

It's based from the boolean search structure of Sphinx

so that's why we have made a function that's converting these keywords into Elasticsearch readable boolean format, from the screenshot above, now to this:

(("Mayor Isko Moreno" OR "Mayor Vico Sotto") AND ("Manila" OR "Pasig"))

Pass them via the Analyze api like this:

GET /myindex/_analyze
{
  "field": "content",
  "text":"My qUerY striNg"
}

This will give you the response:

{
  "tokens" : [
	{
	  "token" : "my",
	  "start_offset" : 0,
	  "end_offset" : 2,
	  "type" : "<ALPHANUM>",
	  "position" : 0
	},
	{
	  "token" : "query",
	  "start_offset" : 3,
	  "end_offset" : 8,
	  "type" : "<ALPHANUM>",
	  "position" : 1
	},
	{
	  "token" : "string",
	  "start_offset" : 9,
	  "end_offset" : 15,
	  "type" : "<ALPHANUM>",
	  "position" : 2
	}
  ]
}

You can take each of these tokens and use them in your aggregations.
I've said before to look at significant_text aggregation but if you're not interested in background index stats or discovering new related terms then maybe the adjacency_matrix or filters aggregation will give you the information you need:

GET myindex/_search
{
  "query": {
	"query_string": {
	  "query": "My qUerY striNg"
	}
  },
  "aggs": {
	"keywordUsage": {
	  "adjacency_matrix": {
		"filters": {
		  "my": {
			"term": {
			  "content": "my"
			}
		  },
		  "query": {
			"term": {
			  "content": "query"
			}
		  },
		  "string": {
			"term": {
			  "content": "string"
			}
		  }
		}
	  }
	}
  }
}

This is sad, I think the analyze API isn't available for the PHP Api of Elasticsearch... :frowning:

I'm using Laravel (a PHP based framework) for the app sir

Up, is there a similar feature same as the explain but not on document-level? I mean, an explain feature that returns a detailed stats of the whole query instead of every document?

I would like to thank sir Mark_Harwood for helping me out, but the procedure he's introducing is quite tedious as the process from fetching the $keywords already take too much memory from the server

The adjacency matrix approach in my last post shouldn’t use much memory?
If you can’t use the analyze api to get the tokens I imagine splitting the search string on whitespace and using ‘match’ query rather than term query in the adjacency matrix filters would work fine.

Hello sir @Mark_Harwood, here's my problem

I'm still trying to pursue the significant_text aggs

image

The 1st output is the list of my $keywords that are used for the query_string query

The 2nd output is my converted one, so from a string, I exploded them and placed on an array, and made them on lower case

The 3rd output shows a summary of my query

Now, what's happening here is that, the significant_text can't look for my 1st and 2nd $keywords

image

And P.S., I can't see the foreground count that you have aforementioned in your replies above sir

They are phrases not words?
Analysers typically tokenise input strings into words so these phrases won’t be in your index.

That’s the ‘doc_count’

1 Like

Reading your example though it looks to me like you want to understand the frequency of the different entities mentioned in the query, not single words.
By that I mean the 2 mayors and the 2 locations. I’d steer you towards adjacency matrix again which can take arbitrary clauses as filters.

1 Like

Oh no if the significant_text only accepts words and not phrases, this would not be the ideal process for the users. Because the goal of our app is to let the users input keywords that they want to search from our database in a text-base scale..

any other way around sir? still testing the adjacency matrix

yes yes! that's correct! you're understanding it right sir! just tested the adjacency_matrix, the bad thing about this is, it is counting the frequency based on doc_count NOT frequency based on word occurrence

I am looking for the same behavior as the _explain API, it counts the frequency of the keyword occurrence PER DOCUMENT, but my goal here is to determine the keyword occurrence all throughout the documents retrieved by my query :frowning:

Is that not typically the more useful measure? A single doc may be spammy and have a lot of repetition/keyword stuffing.
What's the end objective? How does this information get used or benefit the end user?

We will be displaying the Top Keywords given by the user for report purposes sir

For example:

The user entered:

  • Donald Trump
  • Border
  • Mexico

Among these keywords, we'll be using the query_string query, to look up from the documents (articles) in our index and then display it to the user...

Now, sooner or later the user will have a lot of keywords and we would like to give a report feature to display which among his/her keywords is the one appearing the most from the fetched articles