I am trying to build up a semantic search engine on the top of ES, and below is my idea. I was wondering if this makes sense to you, especially for the ranking part. Feel free to be critical.
Simply put, if someone inputs a keyword "sport", and I will traverse the ontology/graph that I already have to find related keywords such as "tennis" and "football" with different weights. Then I user "water" along with the related words to form a new query to ES. Once ES returns the results, I will add the weight into the relevance score and re-rank the result.
You should index all related terms together with a term. Then you don't need to traverse and re-rank at search time, which is bad for performance, and does not scale.
Yes, I have got some ideas through reading literature, and I am working on some of them.
If what you want is just synonym, you can just use the out-of-box function of ES or the plugin Jprante developed;
But if you need something more complex, like what I said in my question. You can
do latent semantic analysis (LSA) with your documents,
or use an existing ontology (wordNet for general purpose),
or build your own ontology in your own way (this is what I am working on, discover semantic relation using user behavior)
As you may know, semantic search is still under active research. There is no off-the-shelf tool you can use. Let me know if you have any idea.
I am also working on similar problem. For now I have only key-value pair as my ontology. I am thinking about REDIS or ES to index this semantic/synonymy data. Then for every search keyword, first query on this index to fetch all similar keywords then query on actual data index with weights in should clause.
What challenges do you foresee in my approach if you thought on these lines.
@Sriharsha_Pothukuchi that is what my reference plugin is doing: fetching a list of variants and index them at index time together with the main form of a word.
I do not recommend query expansion at server side by a plugin. It will add a lot of load to ES. Query expansion would be better at client side. Note that a large number of should clauses leads to slow queries.
I actually ended up with pretty much the same approach as you do. What @jprante is a good approach, when your ontology/synonym is static.
But if your ontology keeps changes or growing (e.g. you are mining knowledge from massive user search behavior), query-time expansion probably is the right direction to go, otherwise you have to re-index everything each time your ontology grows. Also, index-time plugin usually assumes all of the associated words are the same. It becomes problematic when the similarity between A and B is somewhere in between, say 0.8.
Yes, I assume re-indexing is cheap. A reference dictionary of ~10 millions docs with ~40 millions variant forms in the docs with daily changes can be indexed in ~10 minutes here.
I can create conceptual search with LDA/LSA + cosine search and believe it should give better results than on synonyms ontology (especially when try to look for long document). Is it a way (presumably not) to apply this approach to ES?
@sacherus , I know someone is doing this for Solr, but it might be a bit hard to do it with ES. An alternative approach is to store the keyword similarities into ES after performing LDA/LSA. When a keyword A comes in, we first find the most related N keywords, and then use them to create a semantic boost query.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.