I have this data in my index
{
"filekey": "book1",
"word": "carlos",
"page": 1
},
{
"filekey": "book1",
"word": "license",
"page": 1
},
{
"filekey": "book1",
"word": "carlos",
"page": 2
},
{
"filekey": "book1",
"word": "testament",
"page": 2
}
but now i need to have a new index with the next structure
{
"filekey": "book1",
"page": 1,
"words": [
{
"word": "carlos"
},
{
"word": "license"
}
]
},
{
"filekey": "book1",
"page": 2,
"words": [
{
"word": "carlos"
},
{
"word": "testament"
}
]
}
Grouping words by filekey and then by page
I already had a lookat reindex and aggregations
But still not sure how to achieve this. Also i think i need to map words as nested cause i will need to filter elements inside the words array, so i don't get the full array, as it could contain hundreds of words and i need only the ones that matches
The problem:
In the project we are currently searching for words in files, the text is scattered in words because each word have the "topography" of the word in the page, its coordinates, so we can visually highlight the specific text the person is looking for.
Now, we are running a basic query so we can filter the words upon a phrase the user is looking for in a page, and then we collapse the results per page so we don't get duplicated pages, but this is messing the score given by ES as it is searching for words individually and what the user need is to get the best match in the whole page and not per word. I couldn't be able to came up whit a query that groups the words and then filters upon the grouped words with my current structure, and if i would be able to came up with a correct query then it could be bad in performance as we have 800 gb of words stored in the ES index.
But the query would be more natural if i change the structure of the index where words are grouped by page
For context, this is the query i'm currently running
"query": {
"bool": {
"should": [
{
"match": {
"word": "testament salomon"
}
}
]
}
},
"collapse": {
"field": "Page",
"inner_hits": {
"name": "pages",
"collapse": {
"field": "filekey.keyword"
},
"_source": ["Word", "Page"],
"size": 0
}
}
this is the query i would run with the new structure
"query": {
"match": {
"words.word": {
"query": "testament salomon"
}
}
}