This is a continuation of what I've posted months ago, but was not able to comment anymore because the post got locked.
First, I would like to thank sir @Dadoonet for helping me out the last time, I was able to maximize the ES feature called explain, I just made a recursive function to fetch the specific "frequencies" array index from the ES response.
Now this time my specific requirements are, I would like to determine which of my keywords would have the top number of occurrence.
The problem with the explain feature is that it counts the frequency of these words occurred in EACH AND EVERY document but NOT from the WHOLE LIST OF DOCUMENTS received by the search query.
SOLUTIONS I'VE TRIED
SOLUTION 1:
Since I'm using Laravel (a PHP based Framework), I've thought of looping throughout the ['hits']['hits'] part of the response.
But I've thought of, what if I'll get 10,000 documents (the default maximum number of size)? That would be too much memory intensive for the server to loop throughout the ['hits']['hits'] and perform my recursive function just to sum all their frequencies.
Is there an easier way for me to determine the top keywords from my list of fetched documents?
You can give it words of interest in the âincludeâ clause or let it discover the interesting words in your result set (the usual use case)
For each word of interest it will give:
the âforegroundâ count (number of docs matching the query with the word) and
the background count (number of docs in the index with the word)
Awesome! Thank you so much for your response sir @Mark_Harwood!
However, the response of the significant_text aggs also includes other "significant_terms" as said by the response but for me, as the developer and for the client, they're not significant at all
I'm now using a different user together with his saved filters, please refer to the aggs portion sir
{"index":"web","type":"index","body":{"query":{"bool":{"must":[{"query_string":{"default_field":"content","query":"(("Mayor Isko Moreno" OR "Mayor Vico Sotto") AND ("Manila" OR "Pasig"))"}},{"range":{"pub_date":{"from":1569283200,"to":1569888000}}},{"bool":{"must_not":[{"terms":{"mty_id":[14,15,16]}}]}},{"bool":{"should":[{"bool":{"must_not":[{"nested":{"path":"blk","query":{"bool":{"filter":{"match":{"blk.cli_id":"4599"}}}}}},{"nested":{"path":"blk","query":{"bool":{"filter":{"match":{"blk.kgp_id":1738}}}}}}]}}]}},{"bool":{"must_not":[{"terms":{"pub_id":["35209","35232","35268","35270","35296","35297","35298","35299","35300","35319","35374","35375","35376","35377","35382","35383","45208"]}}]}}]}},"aggs":{"KEYWORDS":{"significant_text":{"field":"content","filter_duplicate_text":true,"include":"(("Mayor Isko Moreno" OR "Mayor Vico Sotto") AND ("Manila" OR "Pasig"))"}}},"sort":{"pub_date":{"order":"desc"}}}}
Thanks for that. Your "include" clause will need fixing.
It acts as a filter on the indexed tokens found in the documents (so these will typically be lowercased words and possibly stemmed depending on your choice of analyzer).
If you pass a string in the include clause it is interpreted as a regex.
If you pass an array of strings they are interpreted as exact-match.
It looks like you are passing a single, long, mixed case string which won't match any of the words.
You may want to use the analyze api to convert your search terms like Isco Moreno into individual terms to use in this filter.
The filter_duplicate_text setting is designed to help tune that sort of stuff out.
Another significant text configuration that can help quality and speed of keyword discovery is to use a parent sampler aggregation to only look at a sample of high-quality matches. With large result sets it's also required to limit RAM use
Again, I would like to thank you for your heartfelt willingness to help, I've been really stuck about this problem for a quite a time now
so do you mean sir my $keywords variable should look like an array of strings? like this? ["isko", "moreno"]
The problem with my $keywords variable is, they're actually a data from a MySQL database that are called client_keywords and when I go to this certain app, I just collect these client_keywords and concatenate them to become a string so I could search for them in the query_string query, is my understanding correct sir @Mark_Harwood?
and what's making this harder, our legacy app came from a Sphinxsearch full text server, we're just upgrading and migrating to Elasticsearch, so that's why we're transferring little by little, the client_keyword of this client looks like this
It's based from the boolean search structure of Sphinx
so that's why we have made a function that's converting these keywords into Elasticsearch readable boolean format, from the screenshot above, now to this:
(("Mayor Isko Moreno" OR "Mayor Vico Sotto") AND ("Manila" OR "Pasig"))
You can take each of these tokens and use them in your aggregations.
I've said before to look at significant_text aggregation but if you're not interested in background index stats or discovering new related terms then maybe the adjacency_matrix or filters aggregation will give you the information you need:
Up, is there a similar feature same as the explain but not on document-level? I mean, an explain feature that returns a detailed stats of the whole query instead of every document?
I would like to thank sir Mark_Harwood for helping me out, but the procedure he's introducing is quite tedious as the process from fetching the $keywords already take too much memory from the server
The adjacency matrix approach in my last post shouldnât use much memory?
If you canât use the analyze api to get the tokens I imagine splitting the search string on whitespace and using âmatchâ query rather than term query in the adjacency matrix filters would work fine.
Reading your example though it looks to me like you want to understand the frequency of the different entities mentioned in the query, not single words.
By that I mean the 2 mayors and the 2 locations. Iâd steer you towards adjacency matrix again which can take arbitrary clauses as filters.
Oh no if the significant_text only accepts words and not phrases, this would not be the ideal process for the users. Because the goal of our app is to let the users input keywords that they want to search from our database in a text-base scale..
any other way around sir? still testing the adjacency matrix
yes yes! that's correct! you're understanding it right sir! just tested the adjacency_matrix, the bad thing about this is, it is counting the frequency based on doc_count NOT frequency based on word occurrence
I am looking for the same behavior as the _explain API, it counts the frequency of the keyword occurrence PER DOCUMENT, but my goal here is to determine the keyword occurrence all throughout the documents retrieved by my query
Is that not typically the more useful measure? A single doc may be spammy and have a lot of repetition/keyword stuffing.
What's the end objective? How does this information get used or benefit the end user?
We will be displaying the Top Keywords given by the user for report purposes sir
For example:
The user entered:
Donald Trump
Border
Mexico
Among these keywords, we'll be using the query_string query, to look up from the documents (articles) in our index and then display it to the user...
Now, sooner or later the user will have a lot of keywords and we would like to give a report feature to display which among his/her keywords is the one appearing the most from the fetched articles
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.