Hi everyone,
I just got started with Elasticsearch and it's really awesome! I am, however, struggling with implementing one specific use case. Let's say I have different logical entities like books and authors and I want to implement a free-text search utilizing Elasticsearch. Authors have the attributes firstname and lastname and books have attributes title and subtitle. Intuitively I would create two different indices. If a user typed a search I would like to perform a multi_match query on both of these indices and return the union but based on a single score based ranking. Running one query on each index doesnt yield satisfying results as the scores of two different indices (or rather shards) can't really be compared (probably mainly because of inverse document frequency reasons).
What would be the best practice to implement that use-case? Would I try to use search_type=dfs_query_then_fetch? Would I rather create one denormalized index that contains both books and authors and store the type in an additional attribute? Or would I try to change the scoring algorithm?
Following up on your suggestion to go with the second option, I would get a better understanding on how this could look like and the implications.
For the sake of argument, let's say I also want to store the book description. I assume that i would end up with a sparse index, i.e. I have an index with attributes firstname, lastname, title, subtitle, description and type.
For books all attributes are populated while for author I would only populate type, firstname and lastname. If the user searches for e.g. John in a multi_match query on all these attributes, I see the following problem:
Books from any author with surname or lastname John that additionally contains anywhere in description, title or subtitle also the word John might rank higher than the authors. Because of IDF I could even see that some documents that are not from any author John would be scored higher because they contain John somewhere in the description.
Hey @stelitz, take a look at these docs. You can search across multiple indices quite easily simply by providing a comma-separated list of index names in the URL, which in your case might look like:
GET /books,authors/_search
You'll want to update your query to specify which fields from each index to search against ("first_name", "last_name", "title", "subtitle") but by using that comma-separated list of index names in the path, Elasticsearch will search both indices in parallel and the results will be merged and ranked against each other by score in a single response.
Hi @SeriouslyAwesome,
Thank you for your answer and sorry for the late reply. I will test it out. One more question, however. Let's say I now want to also add the attributes (first_name, last_name) to my books indices (because I also want books of the author to appear if someone enters the name of an author). Will Elasticsearch calculate IDF based on the union of both sets or for both indices individually? I.e. is the score comparable?
Yes, it is common (and often recommended) to denormalize your data like that. You might consider adding an author field to your books index, which itself can have its own nested fields for first_name and last_name (and any other author properties you might want to use when searching books). In your query, you can then match against author.first_name. Take a look at this documentation on nesting objects in your documents.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.