I need to index book texts (mainly extracted from images OCR) and search those pages as fulltext.
TLDR: Should I index
- each part as doc
- or is it better to index records as docs and have dynamic field for pages
Record are books, and parts are pages that have text.
Record A
page1
page2
...
pageN
Record B
page1
page2
...
pageN
I need to search text field value
, aggregate by record_id
and display best N hits by score descending.
My model so far is to index each page as separate document and then do aggregation by record_id.
Current mapping
"mappings": {
"properties": {
"part_id": {
"type": "integer"
},
"record_id": {
"type": "integer"
},
"ri_published": {
"type": "boolean"
},
"rp_visible": {
"type": "boolean"
},
"value": {
"type": "text"
}
}
}
Requirements
- sort desc result by relevance/score (phrase, lucene dismax, ...)
- need to exclude some parts (
part_id
) if they are not visible (rp_visible=false
) - need to page results
-
(later) need to filter by some metadata (not yet in index)
- need to filter by this metadata sometimes
Solutions that I use
Query 1 (older)
{
"_source": {
"excludes": [
"value"
]
},
"query": {
"bool": {
"must": [
{
"match_phrase": {
"value": {
"query": "john wayne"
}
}
}
]
}
},
"size": 0,
"aggs": {
"by_record": {
"terms": {
"field": "record_id",
"order": {
"by_score_max": "desc"
}
},
"aggs": {
"by_top_records": {
"top_hits": {
"size": 3,
"highlight": {
"pre_tags": [
"[ii]"
],
"post_tags": [
"[/ii]"
],
"fields": {
"value": {
"number_of_fragments": 5,
"fragment_size": 100,
"type": "unified",
"order": "none",
"no_match_size": 100
}
}
},
"_source": {
"excludes": [
"value"
]
}
}
},
"by_score_max": {
"max": {
"script": {
"source": "_score"
}
}
}
}
}
}
}
Query 2 (composite) - using current
{
"_source": {
"excludes": [
"value"
]
},
"query": {
"bool": {
"must": [
{
"match_phrase": {
"value": {
"query": "john wayne"
}
}
}
]
}
},
"size": 0,
"aggs": {
"by_record": {
"composite": {
"size": 200,
"sources": {
"record_id": {
"terms": {
"field": "record_id"
}
}
}
},
"aggs": {
"by_record_top": {
"top_hits": {
"size": 3,
"highlight": {
"pre_tags": [
"[ii]"
],
"post_tags": [
"[/ii]"
],
"fields": {
"value": {
"number_of_fragments": 2,
"fragment_size": 100,
"type": "unified",
"order": "none",
"no_match_size": 200
}
}
},
"_source": {
"excludes": [
"value"
]
}
}
},
"by_record_max": {
"max": {
"script": {
"source": "_score"
}
}
},
"agg_bucket_limit": {
"bucket_sort": {
"from": 0,
"size": 20,
"sort": {
"by_record_max": {
"order": "desc"
}
}
}
}
}
}
}
}