I am working on a news app in the gaming industry and I would like to be able to identify headlines/ articles with titles about the same subject.
One thing to note is that games and platforms have many alternative names (PS5/PlayStation 5, WoW/World Of Warcraft, LoL/League of Legends, etc)
So the question is what process/query would give me the best results to find similar titles?
Option 1: More like this
I started with the MLT query with a stopword filter but there are a lot of false positives and missing titles due to the variety of names.
So I thought of 2 other options.
Option 2: Synonyms
Adding synonyms definitions, but it would need to be constantly updated with new games and their alternative names. I believe this could help to search news articles about games, but maybe not for the matching of news articles.
Option 3: Entity recognition
I am not familiar with the Entity recognition module in ES so I did it in a different way.
To recognize games mentioned in titles I have implemented a percolate query that matches games by their possible names.
I was thinking of running that query on the titles of news and then replacing the names of games found by their entities ID and then indexing the title.
From there run a tuned match query, but it means running that on each news coming in.
Option 4: Significant text with shingles
Following this discussion Find trending article I see the possibility of finding what are the trending shingles, but from there I am not sure where to go after that.
Maybe the answer is a mix of those options?
Thank you in advance for your feedback!