Baselining relevancy scoring by creating one massiveindex instead of separate indices

I'm working on being able to query across multiple indices in elasticsearch. Each data source has its own index and its own type. I'm then doing a should match query across in each relevant field across indices and hitting each index by using one alias across them all. Based on what I'm seeing, there is a scoring discrepancy where one larger index of ### million records is appearing to overshadow a smaller index of ## million records.

Even if the same information matches in both indices on a term query, the 50-100 results are typically all from the bigger index. Is there any way to baseline the scoring for relevancy while keeping these indices separate and using an alias to query both? The other option (and the title of this post) that I'm considering but looking for feedback on, is the idea of lumping all the data together into an index, therefore theoretically eliminating the problem described below, quote from the Relevance is Broken! documentation page.

The more times it appears, the more relevant is this document. The inverse document frequency takes into account how often a term appears as a percentage of all the documents in the index. The more frequently the term appears, the less weight it has.

Would this method of using one index instead of multiple have any baselining scoring effect on the records stored? Thanks in advance!

Hi @tdicken73,

you can experiment with a different search type DFS, Query-then-fetch but note that this includes an additional roundtrip from the coordinating node to each involved node so this has an impact on performance. There is also a blog post about the differences of search types.

You could also consider to merge your indices into one but then you likely need to increase the number of shards and doing this is probably a very large change with quite some performance implications. So I'd start by changing the search type and see if this fits your needs.

Daniel

Hey @danielmitterdorfer,

I don't believe DFS, Query-then-fetch works for multiple indices, does it? When I attempt to do this as a GET call, elasticsearch throws the error,

"type": "illegal_argument_exception",
            "reason": "Alias [alias1] has more than one indices associated with it [[index1 ,index2 ,index3 ,index4]], can't execute a single index op" 

It seems like this is what combining everything to one index would be good for. Is there a multi-index operation that functions similarly to dfs_query_then_fetch?

Hi @tdicken73,

can you please describe more exactly how you've tested?

All of the following worked for me on Elasticsearch 2.3.5 with one master node and two data nodes:

POST /index-1/user/1
{
    "name": "Foo"
}

POST /index-2/user/1
{
    "name": "Bar"
}

GET /index*/_search?search_type=dfs_query_then_fetch
{
    "query": {
        "match_all": {}
    }
}

POST /_aliases
{
    "actions": [
       {
          "add": {
             "index": "index-1",
             "alias": "my-indices"
          }
       },
       {
          "add": {
             "index": "index-2",
             "alias": "my-indices"
          }
       }
    ]
}


GET /my-indices/_search?search_type=dfs_query_then_fetch
{
    "query": {
        "match_all": {}
    }
}

Besides you could also use index boost but this can get brittle.

Daniel

1 Like

I believe it was due to a copy/paste from the docs error on my part. Seems to work now, I'll try this out, appreciate it.