I'm working on being able to query across multiple indices in elasticsearch. Each data source has its own index and its own type. I'm then doing a should match query across in each relevant field across indices and hitting each index by using one alias across them all. Based on what I'm seeing, there is a scoring discrepancy where one larger index of ### million records is appearing to overshadow a smaller index of ## million records.
Even if the same information matches in both indices on a term query, the 50-100 results are typically all from the bigger index. Is there any way to baseline the scoring for relevancy while keeping these indices separate and using an alias to query both? The other option (and the title of this post) that I'm considering but looking for feedback on, is the idea of lumping all the data together into an index, therefore theoretically eliminating the problem described below, quote from the Relevance is Broken! documentation page.
The more times it appears, the more relevant is this document. The inverse document frequency takes into account how often a term appears as a percentage of all the documents in the index. The more frequently the term appears, the less weight it has.
Would this method of using one index instead of multiple have any baselining scoring effect on the records stored? Thanks in advance!
you can experiment with a different search type DFS, Query-then-fetch but note that this includes an additional roundtrip from the coordinating node to each involved node so this has an impact on performance. There is also a blog post about the differences of search types.
You could also consider to merge your indices into one but then you likely need to increase the number of shards and doing this is probably a very large change with quite some performance implications. So I'd start by changing the search type and see if this fits your needs.
I don't believe DFS, Query-then-fetch works for multiple indices, does it? When I attempt to do this as a GET call, elasticsearch throws the error,
"type": "illegal_argument_exception",
"reason": "Alias [alias1] has more than one indices associated with it [[index1 ,index2 ,index3 ,index4]], can't execute a single index op"
It seems like this is what combining everything to one index would be good for. Is there a multi-index operation that functions similarly to dfs_query_then_fetch?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.