Determine Top Keywords from the fetched list of documents

If you're echoing back the popularity of what users typed I'd argue that's of limited value.
It's often more useful to give them things other than what they typed to open new lines of enquiry or refine the query.

Here's a real example of your query on news data using significant_text

Note useful phrases like "border agents" are indexed because I used an Analyzer that produces single-word and two-word "shingles".

Another useful technique is to extract names into a structured field using something like rosette or spacy and then use the annotated_text field type to allow drill-downs into the text to show where they were mentioned. Here for example we use significant_terms on the people field (to focus on the significant rather than the popular) and discover the CBP commissioner:

These people names or significant text like "wall" or "barrier" can then be typically added to the query

Copy! Got it sir.

I'm getting more lenient to doc_count than the keyword frequency per document

My problem now is with the adjacency_matrix aggs, my keywords are on string data type. Since we're using Laravel (PHP-based framework) MySQL and Sphinx as the legacy database and search server, the keywords stored doesn't have similar patterns so I can't find a way to segregate the main keywords from the negated ones -- the ones in the NOT format of the query_string query (roughly we already have 25,000+ records now.

Going back to the adjacency_matrix aggs, here's my scenario:

I have this string as the user's keyword:

"(("Mayor Isko Moreno" OR "Mayor Vico Sotto") AND ("Manila" OR "Pasig"))"

With my attempt using the significant_text from our earlier replies, I have found a way to explode the string and convert it into an array

image

Now I have added it to the adjacency_matrix aggs

image

And the result:

image

The problem I'm seeing:
What I'm getting at, is that the adjacency_matrix looks like joining all the 4 significant words from my keywords as one -- considering all documents that all 4 of them are appearing, but logically speaking, the keyword is only choosing one from "Mayor Isko Moreno" OR "Mayor Vico Sotto" AND "Manila" OR "Pasig"

This concept was able to be reproduced by the explain feature of Elasticsearch search API

a sample screenshot of the search API response
image

then here's a sample screenshot of the contents of the explain object

And that's how we were able to display the keyword frequency PER HIT / DOCUMENT

The problem with explain is that it appears on hits-level --- meaning on every hit only
2

Is there a way or a feature similar to explain that is appearing in the same level as the body so I can have a summarized result?

P.S. this is happening within just the search API, didn't need any other calls or endpoints

Latest R&D and update, my aggregation query now looks like this:

"aggs" => [
    "KEYWORDS" => [
        "filters" => [
            "filters" => [
                "term1" => [
                    "term" => [
                        'content' => "isko"
                    ]
                ],
                "term2" => [
                    "term" => [
                        'content' => "manila"
                    ]
                ]
            ]
        ]
    ]
]

and here's the response:

image

I'm almost near to my expected output, however when I wanted the right phrase / keywords, it is not being recognized by Elasticsearch

"aggs" => [
    "KEYWORDS" => [
        "filters" => [
            "filters" => [
                "term1" => [
                    "term" => [
                        'content' => "mayor isko moreno"
                    ]
                ],
                "term2" => [
                    "term" => [
                        'content' => "mayor vico sotto"
                    ]
                ]
            ]
        ]
    ]
]

response
image

Does this mean Elasticsearch can't recognize phrases?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.