Hello,
I'm writing to you, because I can't find the answer myself. I've searched everywhere, others have the same problem, but it seems that there are no solution in the old elastic versions. I'm actualy using elastic 6.7.1
Here is my problem : I would like to query nested documents as if they were main documents (sorted and paginated and highlighted like root types)... Let me explain.
I'm working with legal documentation. Each document is cut into several paragraphs (using nested type). Here is a short summary of my mapping.
Mapping :
"mappings" : { "document" : { "dynamic" : "false", "properties" : { "dateStart" : { "type" : "date", "format" : "yyyy-MM-dd HH:mm:ss" }, "dateEnd" : { "type" : "date", "format" : "yyyy-MM-dd HH:mm:ss" }, "title" : { "type" : "text" } "paragraph" : { "type" : "nested", "properties" : { "title" : { "type" : "text" }, "htmlContent" : { "type" : "text" } } } } } }
Easy enought...
Here comes the tricky part... Sometimes, my clients are searching for documents (they try to find which document contains the laws that were applied at a special time) and sometime, they are searching for paragraphs (as google would)
Is it possible to query elastic to get the paragraphs sorted by score ? A result, where I can paginate on... I needed the ten best paragraphs (in all document, not grouped by document), then the ten next best paragrahs...
Expected result (I'm not showing a real elastic result, only the paraphs IDs) :
{ {score: 8, id: "document_1_paragraph_5", "highlighted best part htmlContent"}, {score: 6, id: "document_7_paragraph_9", "highlighted best part htmlContent"}, {score: 3, id: "document_3_paragraph_4", "highlighted best part htmlContent"}, {score: 1, id: "document_1_paragraph_9", "highlighted best part htmlContent"} }
As you can see, in the result, the document1 has two matching paragraphs, there are not grouped by document...
Actualy, what I'm able to achieve is (ex : a user searching laws applied bewteen 1995 and 2008 about balconies)
Ex :
POST reef4i_19074_1/_search { "query": { "bool": { "must": [ { "query_string": { "query": "(dateStart:[\"1995-05-01 00:00:00\" TO *]) AND (dateEnd:[* TO \"2008-09-30 00:00:00\"])" } }, { "nested": { "path": "paragraph", "query": { "query_string": { "query": "paragraph.htmlContent: ballustrade size on balcony" } }, "inner_hits": { "_source": [ "uises.title", "uises.htmlContent" ] } } } ] } }, "_source": "ONLY_INNER_HITS" }
Result :
{ "document_1": { {score: 8, id: "document_1_paragraph_5", "highlighted best part htmlContent"}, {score: 1, id: "document_1_paragraph_9", "highlighted best part htmlContent"} }, "document_7": { {score: 6, id: "document_7_paragraph_9", "highlighted best part htmlContent"} }, "document_3": { {score: 3, id: "document_3_paragraph_4", "highlighted best part htmlContent"} } }
The problem, with the above result is that I can find, the ten best paragraphs in the ten best documents (I need to retrive 100 paragraphs to make sure I have the 10 best)... But, when I want to paginate, I need to search the 20 best paragraphs in the 20 best documents (which retrives 400 paragraphs), etc...
One solution would be to have two separate indexes (one for clients searching documents, one for clients searching paragraphs), One index with documents and nested paragraps (like my exemple), and another index where paragraph is the root type and the document properties are duplicated...
But maybe, you have another solution, maybe it's possible to achieve what I need with a bucket, an aggregation, etc...
I could maybe execute two queries : one to find the 10 best paragraphs, the other to highlight the content...
If you have a solution where I don't have to create two indexes, it would be great! Even if the highlight doesn't work (I already have a great Lucene Highlighter to display data coming from a mysql database)...
Thanks for helping me... I don't need a full working query, but only a hint on how to search it in your documentation... (what words do I type in google to find the solution...)