I've indexed pdf's page by page. Now to maintain book(parent)-page(child) relationship what is the advantage of using joins?
Now, url of the book stored in file system is also one of the field in my mappings.
so, whenever user access a document, if he needs whole book I can return him the url.
Is this correct way of achieving it or have any dis-advantages? and adding parent-child relation will add any advantage like increase in performance or relevancy ?
Generally speaking it is always good if you somehow can avoid having parent-child relations of nested documents. They always come with a performance penalty and usually require more complex query structures for lookup.
That sounds like a good approach for the pdf use case. Storing large documents in individual chunks (e.g. pages) and storing some pointer to the page number and the document itself (like some url or id) is usually a good practice.
One thing to be aware of is that phrase queries across page boundaries won't work though (e.g. one page ending with "New" and the next page starting with "York", you won't get any matches for the "New York" phrase). I think professional layouters will try to avoid these cases mostly though. A solution to that would be to store interleaved pages, e.g. not only page 1,2,3,4 but also page 1.5-2.5, 2.5-3.5 etc.... Going that path, you will need much more storage space though, so I'm not sure it justifies a means to just solve a relatively rare problem with phrases across page boundaries.
Everything makes sense. Thank you much. very helpful
Regarding the page boundaries I never gave a though about what you said. I was trying using NLP sentence segmentation. but couldn't achieve it. For sure, I need to give it a try as you suggested. though it consumes more memory that's fine.
In addition, when we Index the pdfs page by page, the most relevant document we could find is in 3-6 json docs(response from elasticsearch). it's mainly bcos pages are too long 50-60 lines each.So, currently working on dividing pages into small paragraphs of fixed size.
50-60 lines page will now be divided into 5-6 paragraphs and one paragraph represents one json doc. we are doing this to increase relevancy. does that make sense ?
We are trying to achieve 80% accuarcy atleast. As of now most of the times, as already told, relevant document is in 3-6docs. Also, Please give any suggestions you think off to increase the relevancy. will be very much happy to try them
The way you describe it makes sense to me. It might not always be the right strategy but I've seen things like this quite often.
I dont understand how you define "accuracy". The term is used differently in different areas. I my case, I use "accuracy" mostly as it is used in binary classifcation, where it means roughly how good a classifier predicts a binary outcome. In information retrieval (IR) I often seen presicion as the thing most people mean when they are talking about top-K search results, but there are other common [evaluation metrics](https://en.wikipedia.org/wiki/Evaluation_measures_(IR information_retrieval)).
Giving general advice on how to increase relevancy is difficult because it is very domain specific and depends on your users needs. There is a whole chapter on relevance tuning in the Elasticsearch "The definitive Guide" that just covers the basics, but its a good start if you need to dive deeper into that area. I hope you find it useful and can follow up with more concrete questions with regards to this in some other thread in the forum where I'm sure someone will be happy to answer.
Doug Turnbull and John Berryman’s book “Relevant Search” is fantastic. You might want to check out slides from the last two Haystack conferences too
half-baked idea: if it won’t blow out your index, you might experiment with adding a field that includes the page before, the actual page and the page after. Construct a query that requires a hit on the target page but boosts the relevance score based on the target page w with a lower boost for the context page
This looks interesting. Will try to see what it does in comparison to the Ranking Evaluation API we introduced in Elasticsearch 6.2, especially its UI support which we are currently still lacking in Elasticsearch.
Very helpful as always. really thanks for the suggestions. I will go through them. and yes, constructing a query as you said completely makes sense. Special thanks for that
Want to know whether the below requirement is achieved with Elasticsearch?
For the search results Elasticsearch returns in response to a user query, if the user is not satisfied with the results(we will get a "NO" response from user). Now, whenever any other user searches the same query we should not return the document which Elasticsearch returned last time because it's not relevant as suggested by the last user?
--assumption: users have complete understanding of data, if the user is not satisfied, then for sure it's not a relevant document. So, we should not return it next time.
This sound like a very strange requirement to me, at least from the perspective of a search engine. This is also why I we don't implement anything like this in Elasticsearch. I would add this kind of query blacklist somewhere in your application logic that stores all querys that you don't want to return anything anymore. This way you can return fast and don't even have to run those queries in ES.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.